How to analyze a 2500G dataset that cannot fit into R all at once?

stephenslab / susieR

R package for "sum of single effects" regression.

https://stephenslab.github.io/susieR

Other

174 stars 44 forks source link

How to analyze a 2500G dataset that cannot fit into R all at once? #120

Closed garyzhubc closed 3 years ago

garyzhubc commented 3 years ago

What to do with a 2500G pgen file that cannot fit into R all at once? The experiment on the paper did the analysis on a much smaller number of SNPs. Susie is asking for an R matrix but it will be too big to be loaded.

stephens999 commented 3 years ago

for fine mapping you should be looking at a small region of the genome... eg 1000s or 10k SNPs might be typical. So the short answer is that you need to break up the file into smaller regions

probalica19 commented 3 years ago

Hi @stephens999 ,

I am having the same problem as @garyzhubc , I am interested in a region that is a couple of megabases long, and I am working with a cohort of 500k. How should I proceed with loading data into R?

How do you recommend chunking the genotypic data? Should I go by the number of SNPs or by fixed ranges? Should my ranges overlap? What happens if SNPs from overlapping regions end up in different confidence sets?

Thank you!

Best, probalica

gaow commented 3 years ago

@probalica19 you can partition the genome to non-overlapping chunks with negligible LD between these chunks. See for example https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4731402/. That partitions the European genome into 1700 chunks.

For large cohorts we suggest using sufficient statistics susieR::susie_suff_stat to reduce the complexity from a factor of N (500K) down to P . P should generally be smaller than N when you focus on common variants fine-mapping.