privefl / bigsnpr

R package for the analysis of massive SNP arrays.
https://privefl.github.io/bigsnpr/
186 stars 44 forks source link

snp_readBGEN() could not generate .rds file #211

Closed xscapex closed 3 years ago

xscapex commented 3 years ago

Hi Florian,

Thank you so much for creating such convenient package. I have two questions about the snp_readBGEN() function and I'm wondering if you could give me some advice.

  1. I got the "R session aborted" error when I tried to read .bgen file by using snp_readBGEN() function. In addition, the function had generated the .bk file but there're no .rds file. My R version is 4.0.5. The following is the screenshot of my code, is there anything I can do to import the data?

image

  1. I've read your sample code but I'm not sure whether if I can use "ukb_mfi_chrxx_v3.txt" as list_snp_id since in UK biobabnk data showcase Resouce 531 has indicated that The order of markers in these files is not guaranteed to be the same as the BGEN files.

image

Thank you and I look forward to hear from you. Kind regards, Monica

privefl commented 3 years ago
  1. I think other people may have had the same problem, and reported it in other issues. Things that come to mind is "do you have enough disk space where you want to write?" (this should be checked in the latest version of {bigstatsr}) and "is it the UKBB BGEN files that they provide or some that you made yourself?". And yes, this normal that you get the .bk file because it is the first thing that is produced (but probably not filled with any data), and the .rds file is the last thing that is produced.

  2. The BGI files are used to find where the variants are stored in the BGEN files, so we don't really care about the order here. This is also why this function need SNP IDs instead of just position indices (as for the individuals).

xscapex commented 3 years ago

Hi Florian,

Thank you so much for the prompt reply, I’ll try it and update here!

xscapex commented 3 years ago

I have checked my disk space and it seems that it's enough to write.

`setwd("E:\readBGEN_test1")

Load package

library(data.table) library(bigsnpr)`

###############################################################

Check space

###############################################################

Change R temporary direction

write("TMPDIR = 'E:\Rtmp'", file=file.path(Sys.getenv('R_USER'), '.Renviron')) tempdir() #There are 9TB free space

Check whether we have enough disk space

FBM(500000, 1261158,backingfile="E:\readBGEN_test1\test0") #We need 4.58 T space, enough! FBM(10417, 1261158,backingfile="E:\readBGEN_test1\test1") #We need 98 G space, enough!

file.remove("test0.bk") file.remove("test1.bk")`

The bigsnpr required Rcpp package, so I have installed it and make sure the Rtools work.

`###############################################################

Check Rcpp/Rtools

###############################################################

Rcpp::evalCpp("2+2") #this could output 4`

snp_readBGEN() required list_snp_id and I make sure it must be the form _. I create the list_snp_id from ukb_mfi_chr21_v3.txt which was provided by UK biobank.

`################################################################

list_snp_id

################################################################

list_snp_id <- fread("snpid_chr21.txt",header=F)

list_snp_id <- as.list(list_snp_id)`

This is how my list_snp_id looks like:

image

I first read the .BGEN file which was provided by UK biobank but I got the session error.

`###############################################################

Trial 1: read BGEN without subset

###############################################################

rds <- snp_readBGEN( bgenfiles="ukb_chr21_v3.bgen",backingfile = "test2",ncores = nb_cores(),list_snp_id)

this got session aborted error`

Then I create the subset file which contain 10,417 cases but got the same error, too.

`################################################################

Trial 2: read BGEN subset

################################################################

Create a bgen subset

cmd <- paste0("plink2 --bgen ukb_chr21_v3.bgen ref-first --sample ukb22828_c21_b0_v3_s487268.sample --keep positive_list_10417.txt --export bgen-1.2 --out ukb_chr21_v3_10417") system(cmd)

rds <- snp_readBGEN( bgenfiles="ukb_chr21_v3_10417.bgen",backingfile = "test3",ncores = nb_cores(),list_snp_id=list_snp_id)

this got session aborted error `

I found that some of the list_snp_id were strange, such as 21_19467441_CAA_C. So I only use the one normal list_snp_id to read BGEN but still error,

`###############################################################

Trial 3: only include 1 snp

###############################################################

snp1 <- list_snp_id[[1]][1] #snp1="21_9411239_G_A"

snp1 <- as.list(snp1)

rds <- snp_readBGEN( bgenfiles="ukb_chr21_v3_10417.bgen",backingfile = "test4",ncores = nb_cores(),list_snp_id=snp1)

this got session aborted error`

I have no idea how to fix the problem. Was my list_snp_id wrong?

Thanks, Monica

privefl commented 3 years ago

Can you try specifying a subset of individuals? And also with ncores = 1. Are you using some specific environment, like conda?

xscapex commented 3 years ago
  1. Try ind_row=1 and ncores=1, still doesn't work.

rds <- snp_readBGEN( bgenfiles="ukb_chr21_v3_10417.bgen",backingfile = "F:\\readBGEN\\test4",ncores = 1,list_snp_id=snp1,ind_row = 1)

  1. No, I'm not using conda or other specific environment. In addition, the following is my R and the packages version.

image

privefl commented 3 years ago

Then, last thing, can you try with a relative path, e.g. "test4" instead of "F:\\readBGEN\\test4"?

xscapex commented 3 years ago

rds <- snp_readBGEN( bgenfiles="ukb_chr21_v3_10417.bgen",backingfile = "test4",ncores = 1,list_snp_id=snp1,ind_row = 1)

It still got session error, maybe I should use R 3.x.x version?

privefl commented 3 years ago

You can try, but the version that you're using is the CRAN version, right? This is checked on all kinds of architectures and the newest versions of R.

privefl commented 3 years ago

Could it be due to a corrupted file like in https://github.com/privefl/bigsnpr/issues/212?

privefl commented 3 years ago

Any update on this?

xscapex commented 3 years ago

Hi Florian,

Thank you so much for the eply, I’ll try it next week and update here!

privefl commented 3 years ago

Any update on this?

xscapex commented 3 years ago

Hi Florian,

Sorry for the late reply. I have checked the .bgen files and the md5 number are as same as UKBB (attached file). Still can't figure out why snp_readBGEN() could not generate .rds file. Maybe I should remove the missing value before I use snp_readBGEN().

image

privefl commented 3 years ago

Can you remind me of the exact issue you have?

What do you mean exactly by "Maybe I should remove the missing value before I use snp_readBGEN()."?

privefl commented 3 years ago

Looking at the md5sums, it seems the one for chromosome 5 is different; it should be 45f95365b17d4530a42faf95a70deddd.