lmdu / pyfastx

a python package for fast random access to sequences from plain and gzipped FASTA/Q files
https://pyfastx.readthedocs.io
MIT License
258 stars 20 forks source link

Support xz-compressed files #46

Open joverlee521 opened 2 years ago

joverlee521 commented 2 years ago

Hello @lmdu,

I am planning to use pyfastx within Nextstrain's Augur to support a new data curation command and it would be really helpful to be able to support xz-compressed files. Would you be open to extending pyfastx to support xz-compressed files?

Groups working with large files are using xz to save space because xz has a better compression ratio than gzip. For example, Nextstrain hosts a file of all GenBank SARS-CoV-2 genomes that is xz-compressed.

With the condition that the file was originally compressed in multiple short blocks, it is possible to randomly access xz-compressed files. python-xz is an example of this in pure Python and xz-random-access is an example of this in C.

Thank you!

lmdu commented 1 year ago

Thank you! In the future, I will consider to add support for parsing xz compressed FASTA/Q files.

corneliusroemer commented 1 year ago

Great to hear @lmdu! We are slowly moving from xz to zstd due to faster compression/decompression at no compression ratio loss compared to xz.

Just like xz random access, zstd random access seems to be possible as well. I've found these resources: