Can you support .bed or .bgen files? - Githubissues

pcarbo / varbvs

Large-scale Bayesian variable selection for R and MATLAB.

http://pcarbo.github.io/varbvs

GNU General Public License v3.0

42 stars 14 forks source link

Can you support .bed or .bgen files? #29

Closed garyzhubc closed 3 years ago

garyzhubc commented 3 years ago

Can you support .bed or .bgen files? It is not possible to load SNP data for the entire chromosome into R matrix; the program doesn't recognized BEDMatrix either. For example is.na() or scale() wouldn't work in this case. Would you be interested in supporting this function in the future?

pcarbo commented 3 years ago

@garyzhubc When you say it is not possible to load SNP data for the entire chromosome into R matrix, could you please clarify? Why is this not possible?

garyzhubc commented 3 years ago

The matrix is taking a large space. I've tried seqminer and it's taking 4G space to store 1/250 of the dataset into a numerical matrix in R. The file itself is way smaller i.e. 23G overall. BEDMatrix isn't recognized as an input to varbvs. Other programs that I use like PLINK support .bgen and .bed files, which makes the computation very easy. A feature that support such file format can be helpful although I would still like to try if there's other way that I can load the matrix into R efficiently. I have tried packages like rbgen, bgen-reader, seqminer, BEDMatrix, bigsnpr, gaston but none worked.

pcarbo commented 3 years ago

How many SNPs and samples are there in total?

garyzhubc commented 3 years ago

I have 92546 samples and 571622 SNPs. I have requested 250 GB memory on a full node but this didn't work.

pcarbo commented 3 years ago

That is a large data set. However, I would expect 250 GB to be sufficient. Could you see if it works on a subset of the data, say, a subset of the samples? To reduce memory needs, make that you only have one copy of the genotype data in R at any one time.

Note that PLINK does SNP-by-SNP association analysis so the memory requirements for PLINK are not comparable.

garyzhubc commented 3 years ago

You're right. But I think there should be theoretically enough space for this. If we store each SNP dosage as a float overall it only needs 92546*571622*4 bytes, which is equal to 211.605318 gigabytes in total. I also tried seqminer but it's using more memory than it should. Is there any tool that people usually use to load a big SNP matrix into R?

pcarbo commented 3 years ago

Floating-point numbers in R take up 8 bytes, not 4. You can verify this with object.size. So it looks like you will need closer to 400 GB of memory. Alternatively you can consider taking a summary-statistics approach, and apply a method such as susieR to the summary statistics.

garyzhubc commented 3 years ago

Thanks for the recommendation. I tried seqminer with 500G memory but didn't work as well -- also it seems to me seqminer is dropping some SNPs which it shouldn't do. If I still want to try varbvs, what tool would you recommend to load the data in?

pcarbo commented 3 years ago

varbvs requires as input a matrix; if you try loading your data in another format, you will have to convert it to a matrix if you want to use varbvs. My only other suggestion is that the MATLAB implementation of varbvs allows for single-precision floating point numbers (in R, only double-precision is used), which would cut the memory requirements in 1/2.