marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

feature request: ubam support? #757

Open eboyden opened 9 months ago

eboyden commented 9 months ago

Pretty self-explanatory. We're trying to eliminate the need to process data in fastq format, so it would be terrific if cutadapt could accept ubam input and write ubam output. (We probably don't need the ability to convert fastq to ubam or vice-versa, so mandating ubam in = ubam out would probably be fine, but we wouldn't say no to additional flexibility either.) And if this were implemented, it would be nice to be able to place trimmed sequences or other metadata into sam tags, similar to the current ability to modify fastq read names and/or comments.) Just a suggestion, thanks!

rhpvorderman commented 9 months ago

The latest version of dnaio (the library that cutadapt uses for sequence IO) has uBAM read support. So cutadapt can already read uBAM files.

Most aligners only support FASTQ format though, so there has been no need yet to add uBAM write support.

After implementing the uBAM reader, I have not grown very fond of the BAM format for unaligned pure sequencing data for these reasons:

  1. Overhead. Bam starts with a 36-byte field with all sorts of metadata stored in integers. Some of this data (sequence length) is useful, most of this metadata is related to alignment and therefore unused.
  2. The BAM tag format is extremely clunky to parse. You need to support all the different typecodes. Unlike parsing fields in a FASTQ header. this is a tremendous amount of work.
  3. While it stores the sequence in 4-bit format (for 16 IUPAC codes) and supposedly this leads to smaller data, in practice it does not. FASTQ data and BAM data are both compressed using DEFLATE (which is what is used in gzip) so the differences in filesizes are minimal. BAM is smaller in memory, but not smaller on disk. If you convert the IUPAC to a sequence anyway, the in-memory difference does not matter anymore.

So for pure sequencing data, uBAM adds a lot of parsing difficulties without offering any advantages. While TAGS may seem like a nice format for adding metadata, only very few Tag names that apply to pure sequencing data are anchored in the specification. So most sequencers will have custom named tags anyway. Well in that case a FASTQ header is much, much easier to parse.

Pretty self-explanatory. We're trying to eliminate the need to process data in fastq format ...

Well could you explain yourself a bit more on why you are doing this? What downstream programs that work pre-alignment do require uBAM or would function better with the functionality it offers compared to FASTQ?

marcelm commented 9 months ago

The latest version of dnaio (the library that cutadapt uses for sequence IO) has uBAM read support. So cutadapt can already read uBAM files.

To clarify: Only single-end reads can be read. This is supported from dnaio 1.1.0 onwards. Cutadapt itself does not need to be updated. Since uBAM can only be read, the output will currently be FASTQ with any BAM-specific metadata lost.

I am very much in favor of BAM-only workflows and would like to see Cutadapt support it. The minor inconveniences at parsing time (only relevant for us developers) don’t negate that uBAM is the clearly superior format (for users). Example: Cutadapt could add a @PG header to the output documenting the command that was run for improved reproducibility.

rhpvorderman commented 9 months ago

Ah yes, I did not consider the BAM header and the provenance argument. I am convinced.

Maybe we can create an issue at dnaio and discuss the technical implementation? The alternative is to defer to pysam. As someone who has tried to make an alternative to pysam I can say that full support of SAM/BAM/CRAM is a major undertaking. However, limited BAM support in context of unaligned data only seems feasible.

eboyden commented 9 months ago

Thanks to both of you for your quick and thoughtful responses. I didn't know that cutadapt can read ubam at all, so that's good to know, although it's unfortunate that currently the tag metadata are lost and only SE reads are supported. While I better appreciate the cons of ubams from a developer perspective now, as I'm sure you're aware fastqs have their own parsing issues for users - they use 4 lines to encode one "unit" of information, and storing metadata by concatenating them in the read comment is clunky to say the least (assuming you have a convenient means of putting them there in the first place - cutadapt is a rare software that has excellent functionality in that respect); and then retrieving those data typically requires an aligner that can move fastq comments into sam tags, of which there are also only a few (bwa, mm2, bowtie2, SNAP; incidentally the latter two can read ubams). And indeed, the provenance argument. Many other software tools are increasingly embracing and even preferring ubam for unaligned reads (e.g. fgbio). So I'd still like to see ubam write suuport someday, but I understand that it's easy to make suggestions, less so to implement them, so I appreciate your even taking it under consideration.

rhpvorderman commented 1 month ago

I am currently working with nanopore uBAM data. That contains tags with all sorts information on the Nanopore run as well as methylation data.

I do not want to lose this in the cutadapt step. So looks like I will have to work on BAM writing in dnaio very soon.