seqtk sample not working as expected

antoine4ucsd commented 2 years ago

Hello I am trying to subsample fastq.gz file but not sure if it really works as expected above a given limit.

my source file contains 150k reads

awk '{s++}END{print s/4}' ./BA922J_barcode16_run5_merged.fastq.gz
150626

but when trying to subset:

seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 > BA922J_10000.gz
seqtk sample -s100 BA922Jl_barcode16_run5_merged.fastq.gz 40000 > BA922J_40000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 60000 > BA922J_60000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 80000 > BA922J_80000.gz
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 100000 > BA922J_100000.gz

then the file size is plateauing...

-rwxrwxrwx  1  staff    82M Jun  9 08:36 BA922J_10000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_40000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_60000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:36 BA922J_80000.gz
-rwxrwxrwx  1  staff   314M Jun  9 08:37 BA922J_100000.gz

also need to make sure this is not resampling the same reads. can you confirm (for example if I set the sample to 200k)

not sure what I am doing wrong... thank you!

shenwei356 commented 1 year ago

What's the size of the original file? And also, check the number of reads with seqkit stats.

seqkit stats -j 10 *.gz

PS: The command below does not output gzip format.

seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 > BA922J_10000.gz

# this does
seqtk sample -s100 BA922J_barcode16_run5_merged.fastq.gz 10000 | pigz -c > BA922J_10000.gz

antoine4ucsd commented 1 year ago

thank you. good catch for the typo in the cmd line.

lh3 / seqtk

seqtk sample not working as expected #193