lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.35k stars 310 forks source link

Problem with seqtk sample #199

Closed SplitInf closed 1 year ago

SplitInf commented 1 year ago

Hi, I'm experiencing a similar problem that I now realized others are also facing (e.g. here: 193 ),

My input files have about 300 - 500M reads. When I downsample to 300M, some samples will get stuck and produce nothing even with increased memory and the -2 option. Then as a test I down-sample to 1M and everything ran but then the results are not what is expected:

Sample1_1M_genome.bwt2pairs.pairstat Total_pairs_processed 721151 100.0

Sample2_1M_genome.bwt2pairs.pairstat Total_pairs_processed 1000000 100.0

Sample3_1M_genome.bwt2pairs.pairstat Total_pairs_processed 1000000 100.0

Sample4_1M_genome.bwt2pairs.pairstat Total_pairs_processed 607391 100.0

Sample5_1M_genome.bwt2pairs.pairstat Total_pairs_processed 1000000 100.0

Sample6_1M_genome.bwt2pairs.pairstat Total_pairs_processed 745213 100.0

This is my command

seqtk sample -2 -s 2345 $RAWDIR/${SAMPLE}_R1.fastq \$frac > $RAWDIR/${SAMPLE}_1M_R1.fastq
seqtk sample -2 -s 2345 $RAWDIR/${SAMPLE}_R2.fastq \$frac > $RAWDIR/${SAMPLE}_1M_R2.fastq

[seqtk version: 1.1-r93-dirty]

SplitInf commented 1 year ago

I have rechecked my code and have found that some file links weren't set up properly which caused the script to fail. For future reference, -2 option and giving more time and ram helped.