lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.79k stars 408 forks source link

feat: zstd support #972

Closed j23414 closed 2 years ago

j23414 commented 2 years ago

I see that minimap2 accepts gunzip files:

./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam  

Wondered if there were any plans to support zstd compressed files?

./minimap2 -ax map-pb ref.fa pacbio.fq.xz > aln.sam  

Sorry if this is already a feature and I missed it somehow. For a different project, zstd compression benchmarks seemed very fast.

zstd can 4-10x faster in decompressing sequences than xz with equal compression ratios.

from https://github.com/nextstrain/ncov-ingest/issues/341

Just checking, no pressure, I didn't find an answer with a quick search through issues.

corneliusroemer commented 2 years ago

Especially for sequence alignment files like for SARS-CoV-2 or monkeypox, where there are thousands of similar sequences with lengths >20kB, zstd is extremely good at compressing these data.

It's as good as xz for compression ratios but much faster at compressing and uncompressing.

@j23414 I think you may have mistyped in your second example. I see you have a pacbio.fq.xz file there but may have meant pacbio.fq.zst?

lh3 commented 2 years ago

Please use unix pipes:

xz -dc reads.fq.xz | minimap2 -ax map-pb ref.fa - > aln.sam
j23414 commented 2 years ago

ah got it, thanks:

zstd -d -c reads.fq.zst | minimap2 -ax map-pb ref.fa - > aln.sam