related-sciences / gwas-analysis

GWAS data analysis experiments
Apache License 2.0
24 stars 6 forks source link

Read BGEN files #36

Closed tomwhite closed 4 years ago

tomwhite commented 4 years ago

This changes adds support for reading a BGEN file as a GenotypeDosageDataset object.

It uses the PyBGEN library, which is pure Python (so may need work to optimize for large BGEN files: we shall see). The advantage over bgen-reader is that PyBGEN uses BGEN index files, whereas bgen-reader uses its own 'metafile'. The main problem I saw with bgen-reader is that it opens a new file for every variant it reads, while PyBGEN opens a new file for each batch of variants that are being read (and uses the index to seek appropriately).

tomwhite commented 4 years ago

BTW I spent some time today using this code to load some larger files. I was able to convert a BGEN with 100K variants and 1000 samples to Zarr using this code. Also 1KG ChrX, which took about 5 minutes on my 4 core machine.

So I think this can be merged if you are happy with it now. (BTW I don't know if I have commit permissions yet.)

eric-czech commented 4 years ago

Added write permission for you @tomwhite