marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
513 stars 129 forks source link

gzip.BadGzipFile: Not a gzipped file (b'@m') #688

Closed c383d893 closed 1 year ago

c383d893 commented 1 year ago

Hi cutadapt,

I'm writing because I am using cutadapt (see below) and I get output files, but my error log reports the following error (see below).

Thanks for any thoughts or suggestions, Camille

SCRIPT:

!/bin/bash

SBATCH --time=24:00:00

SBATCH --job-name=primerremoval

SBATCH --output=primerremoval.out

SBATCH --error=primerremoval.err

module load gcc/8.2.0 python/3.9.9 cutadapt --version

mkdir primerremoval

IF="cutadapt_samples.barcodes.noheader.csv"

while read LINE; do NUM=$(echo "$LINE" | awk -F, '{print $1}')

echo “Sample $NUM” cutadapt -g ^GTACACACCGCCCGTCG...GCATATHANTAAGSGSAGGCG$ --discard-untrimmed -o out.fastq.gz demultiplexed/$NUM.out.fastq.gz

mv out.fastq.gz primerremoval/$NUM.trimmed.out.fastq.gz

done < $IF

ERROR:

Traceback (most recent call last): File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/bin/cutadapt", line 8, in sys.exit(main_cli()) File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/main.py", line 965, in main$ main(sys.argv[1:]) File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/main.py", line 1051, in main stats = r.run() File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/pipeline.py", line 1004, in run (n, total1_bp, total2_bp) = self._pipeline.process_reads( File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/pipeline.py", line 402, in proc$ for read in self._reader: File "src/dnaio/_core.pyx", line 244, in fastq_iter File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/gzip.py", line 300, in read return self._buffer.read(size) File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/_compression.py", line 68, in readinto data = self.read(len(byte_view)) File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/gzip.py", line 487, in read if not self._read_gzip_header(): File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/gzip.py", line 435, in _read_gzip_header raise BadGzipFile('Not a gzipped file (%r)' % magic) gzip.BadGzipFile: Not a gzipped file (b'@m')

marcelm commented 1 year ago

Hi, it appears that the file demultiplexed/$NUM.out.fastq.gz is not actually gzip-compressed. That is, it incorrectly has the file name extension .gz. Try renaming that file to just demultiplexed/$NUM.out.fastq and see whether that fixes it. You may want to find out why the file has the wrong extension in the first place. Perhaps the program that generated it does not support gzipped files.

BTW, you can simplify the script a little bit: Instead of letting Cutadapt generate the file out.fastq.gz and then renaming that to primerremoval/$NUM.trimmed.out.fastq.gz, you can just let Cutadapt generate the final output file directly:

cutadapt -g ^GTACACACCGCCCGTCG...GCATATHANTAAGSGSAGGCG$ --discard-untrimmed -o primerremoval/$NUM.trimmed.out.fastq.gz demultiplexed/$NUM.out.fastq.gz
c383d893 commented 1 year ago

Thanks a lot Marcel!

The files are generated using cutadapt in this script, so they should really be .gz files. Let me know if you have any suggestions based on this:

!/bin/bash

SBATCH --time=24:00:00

SBATCH --job-name=demultiplex

SBATCH --output=demultiplex.out

SBATCH --error=demultiplex.err

module load gcc/8.2.0 python/3.9.9 cutadapt --version

module load seqkit/0.16.1

mkdir demultiplexed

IF="cutadapt_samples.barcodes.noheader.csv" FASTQ="m64141e_221217_065033.hifi_reads.fastq.gz"

while read LINE; do NUM=$(echo "$LINE" | awk -F, '{print $1}') BC2=$(echo "$LINE" | awk -F, '{print $2}') BC3=$(echo "$LINE" | awk -F, '{print $3}') BC4=$(echo "$LINE" | awk -F, '{print $4}') BC5=$(echo "$LINE" | awk -F, '{print $5}')

echo “Sample $NUM” cutadapt -g ^$BC2...$BC5$ --discard-untrimmed -o out1.fastq.gz $FASTQ cutadapt -g ^$BC4...$BC3$ --discard-untrimmed -o out2.fastq.gz $FASTQ

seqkit seq out2.fastq.gz -t -r -p -o out2_revcom.fastq.gz cat out1.fastq.gz out2_revcom.fastq.gz > demultiplexed/$NUM.out.fastq.gz

rm out1.fastq.gz rm out2.fastq.gz rm out2_revcom.fastq.gz

done < $IF

c383d893 commented 1 year ago

P.S. I am demultiplexing pacbio dual orientation files, so need to get the reverse complement of the 'reverse' oriented reads. That's why I'm using seqkit here.

Best, Camille

marcelm commented 1 year ago

P.S. I am demultiplexing pacbio dual orientation files, so need to get the reverse complement of the 'reverse' oriented reads. That's why I'm using seqkit here.

Would Cutadapt’s --revcomp option perhaps be useful? Also, Cutadapt has built-in demultiplexing functionality, it seems this would also be useful here.

I cannot see an issue with the above script regarding gzip compression. You could find out which of the files are not properly gzip-compressed by changing it so that it runs file out1.fastq.gz, file out2.fastq.gz and file out2_revcom.fastq.gz before it deletes those files. If that command outputs ASCII text, then you know that the file is not compressed.

cat out1.fastq.gz out2_revcom.fastq.gz > demultiplexed/$NUM.out.fastq.gz
c383d893 commented 1 year ago

Thanks Marcel.

This did not solve the issue. I instead used dada2's primer removal function (removePrimers) in R.

Best, Camille

marcelm commented 1 year ago

Thanks for letting me know.