vdb_shotgun: add "ordered" flag to bbduk command

vdblab / vdblab-shotgun

Shotgun metagenomic sequencing processing pipeline

MIT License

1 stars 1 forks source link

vdb_shotgun: add "ordered" flag to bbduk command #15

Closed funnell closed 1 year ago

funnell commented 2 years ago

from https://www.biostars.org/p/225338/

If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed.

nickp60 commented 2 years ago

Thanks for bringing this up; I'm hesitant to do this for a few of reasons:

I don't see any examples of using both the clumplify and reordering in their docs, they seem to use it either/or
I'm confused by their best practices, which mentions always running bbduk for adapter removal first but neglects to mention clumpify at all.
I don't know how seqkit split2 deals with ordering, but since they have a separate tool for shuffling I am assuming that the splits are unshuffled, and the splits will then have groups of reads with the same prefixes. If we split files where the reads aren't randomly distributed we are going to have a worse time merging the results from each split.
bbduk is fast enough already

funnell commented 2 years ago

this is the "ordered" flag, not "reorder", this option would just make sure that bbduk outputs the reads in the same order, maintaining the compression advantages of the clumpify ordered fastq

When multiple threads are used, reads will not come out in the same order the went in, unless the “ordered” flag is used.

I'm not sure I understand your seqkit split2 point, my understanding is that clumpify reorders the reads anyways so they wouldn't be randomly distributed at that point?

nickp60 commented 2 years ago

I'm still confused by this. I don't see any examples in that biostars page or across github where people use ordered or reorder when using clumplify for deduplication. It seems like there are two applications of this tool: compression or deduplication. I may be interpreting it wrong though. Do you have any usage examples where people have used both?

funnell commented 2 years ago

my understanding is that the compression benefits come from clumpify reordering reads in a certain way. the "ordered" flag for BBDuk just forces it to output reads in the same order, maintaining the compression benefits of the the clumpify-reordered reads.