Memory explosion - Githubissues

limix / pandas-plink

PLINK reader for Python.

MIT License

78 stars 18 forks source link

When I try to load a small subset from the 1000 Genome my memory blows up during the data import. In particular it seems dask uses all available threads to spawn python session which in turn import the whole dataset.

I am not sure how to solve this issue. Do I need spawn a local cluster before I start the import?

I would be thankful for any help!

See below for a reproducible example. The used data is available at ftp://climb.genomics.cn/pub/10.5524/100001_101000/100116/1kg_phase1_chr2.tar.gz.

import numpy as np
from pandas_plink import read_plink

(bim, fam, bed) = read_plink('data/genotypes/1kg_phase1_chr2')  
rand = np.random.choice(bim.i.values, 1000) 
X = bed[rand, :].compute()

limix / pandas-plink

Memory explosion #5