vdblab / vdblab-shotgun

Shotgun metagenomic sequencing processing pipeline
MIT License
1 stars 1 forks source link

vdb_shotgun: add "ordered" flag to bbduk command #15

Closed funnell closed 1 year ago

funnell commented 2 years ago

from https://www.biostars.org/p/225338/

If you want to clumpify data for compression, do it as early as possible (e.g. on the raw reads). Then run all downstream processing steps ensuring that read order is maintained (e.g. use the “ordered” flag if you use BBDuk for adapter-trimming) so that the clump order is maintained; thus, all intermediate files will benefit from the increased compression and increased speed.

nickp60 commented 2 years ago

Thanks for bringing this up; I'm hesitant to do this for a few of reasons:

funnell commented 2 years ago

this is the "ordered" flag, not "reorder", this option would just make sure that bbduk outputs the reads in the same order, maintaining the compression advantages of the clumpify ordered fastq

When multiple threads are used, reads will not come out in the same order the went in, unless the “ordered” flag is used.

I'm not sure I understand your seqkit split2 point, my understanding is that clumpify reorders the reads anyways so they wouldn't be randomly distributed at that point?

nickp60 commented 2 years ago

I'm still confused by this. I don't see any examples in that biostars page or across github where people use ordered or reorder when using clumplify for deduplication. It seems like there are two applications of this tool: compression or deduplication. I may be interpreting it wrong though. Do you have any usage examples where people have used both?

funnell commented 2 years ago

my understanding is that the compression benefits come from clumpify reordering reads in a certain way. the "ordered" flag for BBDuk just forces it to output reads in the same order, maintaining the compression benefits of the the clumpify-reordered reads.