limix / pandas-plink

PLINK reader for Python.
MIT License
78 stars 18 forks source link

Memory explosion #5

Closed rmporsch closed 6 years ago

rmporsch commented 6 years ago

When I try to load a small subset from the 1000 Genome my memory blows up during the data import. In particular it seems dask uses all available threads to spawn python session which in turn import the whole dataset.

I am not sure how to solve this issue. Do I need spawn a local cluster before I start the import?

I would be thankful for any help!

See below for a reproducible example. The used data is available at ftp://climb.genomics.cn/pub/10.5524/100001_101000/100116/1kg_phase1_chr2.tar.gz.

import numpy as np
from pandas_plink import read_plink

(bim, fam, bed) = read_plink('data/genotypes/1kg_phase1_chr2')  
rand = np.random.choice(bim.i.values, 1000) 
X = bed[rand, :].compute()
horta commented 6 years ago

Thanks for the interest and the usage example!

The reason is the chunk size. In your use case, dask is reading all chunks, which in practice means that the whole file is being read. I decreased the chunk size to accommodate your use case. It went from 20G to 7G peak of memory usage in dask versions from 0.12 up to 0.16. The pandas-plink version 1.2.17 containing the fix should be in PyPI in an hour or so. Please, let me know how it goes.