sampling compressed files

lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats

MIT License

1.35k stars 311 forks source link

sampling compressed files #155

Closed dalilasss closed 4 years ago

dalilasss commented 4 years ago

I wanted to know if it is possible to use seqtk sample on a zipped file .fastq.gz and have the output also in the same .fastq.gz format? The issue is that I tried it zith the command [guest@u]$ seqtk sample -s seed=75 SRR7.fastq.gz 0.8 > SRR7.fastq.gz with the input file of size 6.3 Gb , and the output size is 16.8 Gb and when i use zcat command it doesn't recognize the format, so I believe the output is a regular fastq format The number of sequences is correct in the output file.

Could you please tell if there is a way to change it? Thank you!

lh3 commented 4 years ago

pipe to gzip

EwersAquaGenomics commented 1 year ago

Hallo! Just to clarify: I can use a fastq.gz file as input for seqtk sample, but the output is in .fastq format?

Thank you!

fconstancias commented 1 year ago

That's what @lh3 meant: seqtk sample -s seed=75 SRR7.fastq.gz 0.8 | gzip > SRR7.fastq.gz

danchurch commented 1 week ago

Btw, for me, if I do not rename the output gzipped file to something new, it clobbers the original file with an empty file. e.g., if I use:

seqtk seq -L 50 300_S300_L001_R1_001.fastq.gz | gzip > 300_S300_L001_R1_001.fastq.gz

the final fastq.gz file is empty. However, if I use:

seqtk seq -L 50 300_S300_L001_R1_001.fastq.gz | gzip > 300_S300_L001_R1_001.fastq_noSmalls.gz

I'm using seqtk 1.3-r106

the new file (300_S300_L001_R1_001.fastq_noSmalls.gz) contains desired reads.

shenwei356 commented 1 week ago

seqtk seq -L 50 300_S300_L001_R1_001.fastq.gz | gzip > 300_S300_L001_R1_001.fastq.gz

You overwrote the input file, so it became empty. This is a dangerous operation. The correct way is 1) write the filtered data to a new file, 2) rename the new file to the old file.

But I would choose to keep the original files, for safety.

danchurch commented 1 week ago

@shenwei356 Agree. I'm just noting this because @fconstancias's solution as of this writing does the same - it clobbers the original file. I thought other folks should know.