Closed c383d893 closed 1 year ago
Hi, it appears that the file demultiplexed/$NUM.out.fastq.gz
is not actually gzip-compressed. That is, it incorrectly has the file name extension .gz
. Try renaming that file to just demultiplexed/$NUM.out.fastq
and see whether that fixes it. You may want to find out why the file has the wrong extension in the first place. Perhaps the program that generated it does not support gzipped files.
BTW, you can simplify the script a little bit: Instead of letting Cutadapt generate the file out.fastq.gz
and then renaming that to primerremoval/$NUM.trimmed.out.fastq.gz
, you can just let Cutadapt generate the final output file directly:
cutadapt -g ^GTACACACCGCCCGTCG...GCATATHANTAAGSGSAGGCG$ --discard-untrimmed -o primerremoval/$NUM.trimmed.out.fastq.gz demultiplexed/$NUM.out.fastq.gz
Thanks a lot Marcel!
The files are generated using cutadapt in this script, so they should really be .gz files. Let me know if you have any suggestions based on this:
module load gcc/8.2.0 python/3.9.9 cutadapt --version
module load seqkit/0.16.1
mkdir demultiplexed
IF="cutadapt_samples.barcodes.noheader.csv" FASTQ="m64141e_221217_065033.hifi_reads.fastq.gz"
while read LINE; do NUM=$(echo "$LINE" | awk -F, '{print $1}') BC2=$(echo "$LINE" | awk -F, '{print $2}') BC3=$(echo "$LINE" | awk -F, '{print $3}') BC4=$(echo "$LINE" | awk -F, '{print $4}') BC5=$(echo "$LINE" | awk -F, '{print $5}')
echo “Sample $NUM” cutadapt -g ^$BC2...$BC5$ --discard-untrimmed -o out1.fastq.gz $FASTQ cutadapt -g ^$BC4...$BC3$ --discard-untrimmed -o out2.fastq.gz $FASTQ
seqkit seq out2.fastq.gz -t -r -p -o out2_revcom.fastq.gz cat out1.fastq.gz out2_revcom.fastq.gz > demultiplexed/$NUM.out.fastq.gz
rm out1.fastq.gz rm out2.fastq.gz rm out2_revcom.fastq.gz
done < $IF
P.S. I am demultiplexing pacbio dual orientation files, so need to get the reverse complement of the 'reverse' oriented reads. That's why I'm using seqkit here.
Best, Camille
P.S. I am demultiplexing pacbio dual orientation files, so need to get the reverse complement of the 'reverse' oriented reads. That's why I'm using seqkit here.
Would Cutadapt’s --revcomp
option perhaps be useful? Also, Cutadapt has built-in demultiplexing functionality, it seems this would also be useful here.
I cannot see an issue with the above script regarding gzip compression. You could find out which of the files are not properly gzip-compressed by changing it so that it runs file out1.fastq.gz
, file out2.fastq.gz
and file out2_revcom.fastq.gz
before it deletes those files. If that command outputs ASCII text
, then you know that the file is not compressed.
cat out1.fastq.gz out2_revcom.fastq.gz > demultiplexed/$NUM.out.fastq.gz
Thanks Marcel.
This did not solve the issue. I instead used dada2's primer removal function (removePrimers) in R.
Best, Camille
Thanks for letting me know.
Hi cutadapt,
I'm writing because I am using cutadapt (see below) and I get output files, but my error log reports the following error (see below).
Thanks for any thoughts or suggestions, Camille
SCRIPT:
!/bin/bash
SBATCH --time=24:00:00
SBATCH --job-name=primerremoval
SBATCH --output=primerremoval.out
SBATCH --error=primerremoval.err
module load gcc/8.2.0 python/3.9.9 cutadapt --version
mkdir primerremoval
IF="cutadapt_samples.barcodes.noheader.csv"
while read LINE; do NUM=$(echo "$LINE" | awk -F, '{print $1}')
echo “Sample $NUM” cutadapt -g ^GTACACACCGCCCGTCG...GCATATHANTAAGSGSAGGCG$ --discard-untrimmed -o out.fastq.gz demultiplexed/$NUM.out.fastq.gz
mv out.fastq.gz primerremoval/$NUM.trimmed.out.fastq.gz
done < $IF
ERROR:
Traceback (most recent call last): File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/bin/cutadapt", line 8, in
sys.exit(main_cli())
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/main.py", line 965, in main$
main(sys.argv[1:])
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/main.py", line 1051, in main
stats = r.run()
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/pipeline.py", line 1004, in run
(n, total1_bp, total2_bp) = self._pipeline.process_reads(
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/site-packages/cutadapt/pipeline.py", line 402, in proc$
for read in self._reader:
File "src/dnaio/_core.pyx", line 244, in fastq_iter
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/gzip.py", line 300, in read
return self._buffer.read(size)
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/gzip.py", line 487, in read
if not self._read_gzip_header():
File "/cluster/apps/nss/gcc-8.2.0/python/3.9.9/x86_64/lib64/python3.9/gzip.py", line 435, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'@m')