Closed rmporsch closed 6 years ago
Thanks for the interest and the usage example!
The reason is the chunk size. In your use case, dask is reading all chunks, which in practice means that the whole file is being read. I decreased the chunk size to accommodate your use case. It went from 20G to 7G peak of memory usage in dask versions from 0.12 up to 0.16. The pandas-plink version 1.2.17 containing the fix should be in PyPI in an hour or so. Please, let me know how it goes.
When I try to load a small subset from the 1000 Genome my memory blows up during the data import. In particular it seems
dask
uses all available threads to spawn python session which in turn import the whole dataset.I am not sure how to solve this issue. Do I need spawn a local cluster before I start the import?
I would be thankful for any help!
See below for a reproducible example. The used data is available at ftp://climb.genomics.cn/pub/10.5524/100001_101000/100116/1kg_phase1_chr2.tar.gz.