slowkow / proxysnps

:bookmark: Get SNP proxies from the 1000 Genomes Project.
MIT License
28 stars 9 forks source link

Memory usage #4

Open TaruMuranen opened 6 years ago

TaruMuranen commented 6 years ago

Hi,

I would need to find proxies (r2>0.8) for about 8000 snps from about 300 genomic regions (region defined so that the distance between consecutive snps is less than 1 000 000 bases). Using get_proxies per snp in a for-loop or with apply requires horribly lot of memory (10Gb is reached with about 50 snps). Apparently get_proxies calls get_vcf, which downloads huge datafiles from web.

Is there any way to free memory after each snps? Or should I download all required data in advance and store it locally? How would I then run get_proxies?

Or would you suggest a better way of finding the proxies? SNAP proxy search has only 1000 genomes pilot. LDlink does not appear suitable for this many snps. Both have restrictions for the search region width.

Best wishes

/tm

slowkow commented 6 years ago

Since you need proxies for 8000 SNPs, I would not recommend using proxysnps. It will download the same data and recompute the same statistics multiple times without caching any intermediate results.

As you suggested, I would recommend downloading all of the genotype data and storing it locally. Right now, get_proxies() does not support querying local files, but this feature should be easy to add. If I find the time to add this feature, I'll reply to this issue and let you know.

For now, here's another approach that you might consider:

https://gist.github.com/slowkow/3d13aa44cf4f65ca9ad2a0570346ba05

TaruMuranen commented 6 years ago

Thanks for sharing your code. I'll try this.

slowkow commented 6 years ago

You're very welcome! Please let me know if you run into any issues.