lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.38k stars 308 forks source link

Request: Add seqtk shuffle command to randomise order of reads #134

Open peterjc opened 5 years ago

peterjc commented 5 years ago

I have been creating mock community samples using seqtk sample on some single isolate inputs, something like this:

rm -rf tempR1.fastq tempR2.fastq
for sample in A B C; do
    seqtk sample -s 123 input${sample}_R1.fastq.gz 10000 >> tempR1.fastq
    seqtk sample -s 123 input${sample}_R2.fastq.gz 10000 >> tempR2.fastq
done
gzip tempR1.fastq
gzip tempR2.fastq

In this example my combined FASTQ files will have the reads from sample A, then sample B, and finally sample C - and this ordering may introduce biases in the downstream analysis.

What I would like to do is finish with something like this:

seqtk shuffle -s 123 tempR1.fastq | gzip > mixed_R1.fastq.gz
seqtk shuffle -s 123 tempR2.fastq | gzip > mixed_R2.fastq.gz

Here I am assuming -s would set the random number seed as used in seqtk sample to ensure that both R1 and R2 are randomised in the same way, and the output remains nicely paired.

tseemann commented 4 years ago

@peterjc Until this is implemented, you can use seqkit shuffle