marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Missing output in log/json files #817

Closed luigallucci closed 5 days ago

luigallucci commented 1 week ago

Hi @marcelm ,

i'm using the actual stable version of cutadapt on python 3.10.

My question is regarding the demultiplexing of mixed oriented reads.

/cutadapt \
    -j 30 \
    -e 0.1 \
    --revcomp \
    --no-indels \
    --minimum-length 200 \
    --discard-untrimmed \
    -g "^file:${temp_output_file};o=99" \
    -G "^file:${rev_primer}" \
    -o ${dir}/demu_{name}.R1.fastq.gz -p ${dir}/demu_{name}.R2.fastq.gz \
    ${input_directory_R1} ${input_directory_R2} \
    --json=${dir}/${run}.cutadapt.json 

My actual command looks like that. Basically, I'm able to get almost full trimming (96%) using --revcomp. Without that flag, around 45%. Anyway, I'm pretty sure that the reads are mixed oriented as the sequencing center has a long-time collaboration and they always worked in that way, producing the mixed oriented reads for metabarcoding (Illumina Next-seq, 2 x 300 bp).

First, have you any comment in general on the command? are they correct for that type of data? Second, my question regarding the use of --revcomp, is that adding this flag for me is not possible to get a normal stats output of cutadapt.

Here is everything I get:

This is cutadapt 4.8 with Python 3.10.12
Command line parameters: -j 30 -e 0.1 --revcomp --no-indels --minimum-length 200 --discard-untrimmed -g ^file:/tmp/tmp.xiGe94flAw;o=99 -G ^file:/bioinf/home/lgallucc/Barcodes_primer/primersR2.fasta -o /bioinf/home/lgallucc/M19>
Processing paired-end reads on 30 cores ...
Finished in 51.495 s (9.449 µs/read; 6.35 M reads/minute).

=== Summary ===

Total read pairs processed:          5,449,554

== Read fate breakdown ==
Pairs that were too short:              23,314 (0.4%)
Pairs discarded as untrimmed:          240,519 (4.4%)
Pairs written (passing filters):     5,185,721 (95.2%)

Total basepairs processed: 3,274,676,164 bp
  Read 1: 1,636,246,399 bp
  Read 2: 1,638,429,765 bp
Total written (filtered):  2,892,756,313 bp (88.3%)
  Read 1: 1,430,803,451 bp
  Read 2: 1,461,952,862 bp

While usually is supposed to give me also information on the different barcode, right? I'm looking forward to heard your opinion on that.

EDIT: I'm doing demultiplexing and at the same time also primers removal (so on the -G there is a single line corresponding to the reverse primer, while on the -g all the barcodes+forwardprimer)

marcelm commented 1 week ago

Hi, thanks for the report!

Regarding your command, it may be that -j 30 is a bit excessive; it could be that the overhead of communicating between threads is a bit high with that many threads. Maybe you should measure whether you actually get a speedup over, let’s say -j 16.

Also, -e 0.1 is the default maximum error rate, so you could leave out that part of the command.

Regarding stats with --revcomp, I can confirm that the report is incomplete when using --revcomp with paired-end data. Minimal example:

cutadapt --revcomp -o 1.fastq -p 2.fastq tests/data/revcomp.1.fastq tests/data/revcomp.2.fastq

Also, when I add --json stats.cutadapt.json, a file is produced, but it contains "reverse_complemented": null.

I’ll look into this.

luigallucci commented 6 days ago

Hi, thanks for the report!

Regarding your command, it may be that -j 30 is a bit excessive; it could be that the overhead of communicating between threads is a bit high with that many threads. Maybe you should measure whether you actually get a speedup over, let’s say -j 16.

Also, -e 0.1 is the default maximum error rate, so you could leave out that part of the command.

Thank you for the feedback!

Regarding stats with --revcomp, I can confirm that the report is incomplete when using --revcomp with paired-end data. Minimal example:

cutadapt --revcomp -o 1.fastq -p 2.fastq tests/data/revcomp.1.fastq tests/data/revcomp.2.fastq

Also, when I add --json stats.cutadapt.json, a file is produced, but it contains "reverse_complemented": null.

I’ll look into this.

Thank you again for the work and developing effort :)

marcelm commented 5 days ago

This is fixed now.