nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
493 stars 59 forks source link

Request: Support for Gzip #1023

Open GabeAl opened 1 week ago

GabeAl commented 1 week ago

Issue Report

Please describe the issue:

Gzip is, by far, the most commonly used format to compress sequencing data, including fastq files generated by your platform. gzip files are/were provided to us from 4/5 of the vendors I've worked with for bacterial sequencing (the last was uncompressed), and all academic cores I've worked with.

The fact that this software only supports the (relatively) esoteric bgzip, which provides only marginal space advantages (ratio goes from 30% to 20% iirc -- not worth the time and compatibility cost as storage is cheap), makes this decision odd to me. This is particularly surprising considering the availability of system libraries for gzip available on all modern platforms, and the easily integrated implementations of zlib and its derivatives.

I hope the developers will rethink their stance and appreciate the landscape of compressed data overwhelmingly favors .gz in the majority of contexts.

Steps to reproduce the issue:

Use any gzipped inputs.

svc-jstone commented 1 week ago

Hi @GabeAl ,

It's not clear which specific tool you are referring to.

In dorado correct for example, we cannot use the common gzip compression because it does not allow indexing (as you point out in the title) needed for random access to sequences, which is crucial for correction. This is a key feature that bgzip provides.

Best regards, JS.

sklages commented 1 week ago

Additionally bgzip is pretty standard in the field of sequencing data, e.g. samtools/htslib do use bgzip as well. See http://www.htslib.org/doc/bgzip.html

Illumina also uses bgzip for fastq compression with bcl-convert/bcl2fastq AFAIK on Linux. Element Bioscience (Aviti) does interestingly not :-)

The level of compression can be set by (b)gzip with "--compress-level". It is not per se a question of the program used. No idea which level is set by dorado (probably a "lighter" one).

GabeAl commented 1 week ago

Thanks for your comment. Yes, I am aware of this. However, the microbiome field (including the CROs, partners, and clients we deal with) has limited use for htslib. Most files are fastq.gz, fasta.gz, etc. Very rarely do we even encounter bam files. Even for base calling, we output fastq.

Support for gzip is the most democratic and field-spanning option. It is also straightforward to implement from ONT's side, but not straightforward for us to change a whole field's preferences, nor to recompress perabytes of existing datasets to satisfy one tool chain (ONT) while losing compatibility with almost all others in our field.

On Wed, Sep 18, 2024, 2:08 PM sklages @.***> wrote:

Additionally bgzip is pretty standard in the field of sequencing data, e.g. samtools/htslib do use bgzip as well. See http://www.htslib.org/doc/bgzip.html

The level of compression can be set by (b)gzip with "--compress-level". It is not per se a question of the program used. No idea which level is set by dorado (probably a "lighter" one).

— Reply to this email directly, view it on GitHub https://github.com/nanoporetech/dorado/issues/1023#issuecomment-2359107550, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5NOBSXE3RF7MEEN4JZ2D3ZXG6RJAVCNFSM6AAAAABOMBYA5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJZGEYDONJVGA . You are receiving this because you were mentioned.Message ID: @.***>

sklages commented 1 week ago

Which software are you actually referring to?

If you need random/parallel access to compressed data files (in many cases to simply save time) you cannot use gzip. We - as an academic sequencing core - provide sequence data files in a format produced by the manufacturer's software. With the default compression type and level. Illumina's bcl2fastq/bcl-convert uses bgzip compression, Aviti seems to use gzip (for some reason) and PacBio and ONT write uBAM (which is the best format for sequencing data storage IMHO).

for us to change a whole field's preferences, nor to recompress perabytes of existing datasets to satisfy one tool chain (ONT) while losing compatibility with almost all others in our field.

I don't get this. Can you describe in more detail? Why should you be forced to "recompress petabytes of data to use one toolchain from ONT"? Which one?

GabeAl commented 1 week ago

Sure, but I think we're getting off topic a bit. I understand you are fine with bzip2 format, as are most "large organism" focused sequencing labs. I am simply requesting another format, gzip. I have many large bacterial community sequencing projects in .fastq.gz. I wouldn't personally want to trade more dependencies and opaque storage formats over easily parsable and widely supported fastq.gz. Most microbiome-centric tools don't recognize .bz2, from functional callers to annotators to microbiome-focused aligners, whole-genome taxonomy and completeness checkers, etc. I want to use the same format everywhere, especially since it is both the simplest and the most ubiquitous (fastq and fasta).

Yes, @svc-jstone , I am referring to dorado correct, but the same idea applies throughout the pipeline from fastq to medaka polished genome(s). There are some workarounds to indexing gzipped files from the simplest to the more complex. Simple ones include just extracting it to a temporary location, indexing, and cleaning up later (this is what I end up having to manually do). Or reading in the gzipped fastq into memory (we have plenty of RAM) which is even faster than repeatedly incurring random disk access + multiple rounds of bzip2 decompression per access. More complicated would be to relax the need for indexing in the first place -- a 2-pass approach may work well, and much of the file will be in cache so we don't incur I/O penalty twice (Pass 1: determine locations and mappings for pileup matrix, pass 2: apply correction and regurgitate reads, in order).

But if it's really not possible for certain stages without significant effort or tradeoffs (i.e. indexing is critical and the "30x human genome" folks can't withstand the temp space/RAM hit for auto-extraction of gzipped fastqs, for example), I can live with that.

sklages commented 1 week ago

Bzip2 != bgzf. Bzip2 uses a different algorithm and is not compatible with gzip/bgzf.

AFAIK most sequence-related tools use bgzf/gzip and not bzip2. Dorado correct also uses bgzf.

GabeAl commented 1 week ago

Tools we use support fasta, fastq, and their gzip compressed versions. They do not support anything else. Apologies for the typo in my previous message, but thought I'd make that clear.

I'm a developer of microbiome tools and have published extensively in the field. I also developed my own compressed format for a tool of mine, called TCF. I know many microbiome-based tools, including my own, do not widely support compressed formats other than gzip. I do not understand what you're getting at.

Yes dorado correct also has the problem among many others.

[edit] Oops, pasted some other error here by mistake, which is that there is a race condition with the indexing when running multiple instances of dorado in parallel from the same working directory. The indices seem to overwrite each other and the job crashes and burns. Apparently they must be run serially?