Closed funnell closed 1 year ago
Thanks for bringing this up; I'm hesitant to do this for a few of reasons:
seqkit split2
deals with ordering, but since they have a separate tool for shuffling I am assuming that the splits are unshuffled, and the splits will then have groups of reads with the same prefixes. If we split files where the reads aren't randomly distributed we are going to have a worse time merging the results from each split.this is the "ordered" flag, not "reorder", this option would just make sure that bbduk outputs the reads in the same order, maintaining the compression advantages of the clumpify ordered fastq
When multiple threads are used, reads will not come out in the same order the went in, unless the “ordered” flag is used.
I'm not sure I understand your seqkit split2 point, my understanding is that clumpify reorders the reads anyways so they wouldn't be randomly distributed at that point?
I'm still confused by this. I don't see any examples in that biostars page or across github where people use ordered
or reorder
when using clumplify for deduplication. It seems like there are two applications of this tool: compression or deduplication. I may be interpreting it wrong though. Do you have any usage examples where people have used both?
my understanding is that the compression benefits come from clumpify reordering reads in a certain way. the "ordered" flag for BBDuk just forces it to output reads in the same order, maintaining the compression benefits of the the clumpify-reordered reads.
from https://www.biostars.org/p/225338/