poseidon-framework / poseidon-hs

A toolset to work with modular genotype databases in the Poseidon format
https://poseidon-framework.github.io/#/trident
MIT License
7 stars 2 forks source link

Allow reading genotype data in compressed archives #237

Open nevrome opened 1 year ago

nevrome commented 1 year ago

Maybe this could be implemented with sth. like pipes-zlib. It would allow for even smaller file sizes, which in turn would simplify and speed up a lot of our operations.

Ideally poseidon-hs should recognize .[bed|bim|geno|snp].gz suffixes in file names and stream the respective files accordingly when reading a package.

I suggest we play around with this here to see if it's possible and feasible. Later we could consider adding it to the standard.

stschiff commented 1 year ago

yes. Note that last time I tried pipes-zlib sadly suffered from this bug: https://github.com/k0001/pipes-zlib/issues/16 which was actually a bug in some other library upstream. I ended up decompressing directly from lazy bytestring (https://hackage.haskell.org/package/zlib-0.6.3.0/docs/Codec-Compression-Zlib.html) before then piping it through a suitable Pipes.Parser. So, definitely possible, but definitely also requires some playing around.