-G does not behave as expected, i.e. as -g for R2.

marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads

https://cutadapt.readthedocs.io

MIT License

502 stars 125 forks source link

-G does not behave as expected, i.e. as -g for R2. #692

Open ryandward opened 1 year ago

ryandward commented 1 year ago

It doesn't seem that switching -g to -G, and also switching the order of the reads results in the same results.

❯ cutadapt \
  --action=none \
  -G ^file:adapter_file.fasta \ # <-- looking at read 2
  -j 12 --no-indels --error-rate=0 \
  -o demultiplexed_{name}_R1.fastq.gz \
  -p demultiplexed_{name}_R2.fastq.gz \
  read1.fastq.gz \
  read2.fastq.gz

This results in everything going to files with unknown in the variable portion of the name.

However, when I switch around -g to -G, and switch places of read1 and read2, adapter trimming works as expected.

❯ cutadapt \
  --action=none \
  -g ^file:adapter_file.fasta \ # <-- changed to lowercase, i.e. looking at read1
   -j 12 --no-indels --error-rate=0 \
  -o demultiplexed_{name}_R1.fastq.gz \
  -p demultiplexed_{name}_R2.fastq.gz \
  read2.fastq.gz \ # <-- switched order
  read1.fastq.gz

❯ conda --version conda 22.11.1 ❯ cutadapt --version 4.3

marcelm commented 1 year ago

This may be easy to overlook, but it is documented in the section about demultiplexing:

Paired-end demultiplexing always uses the adapter matches of the first read to decide where a read should be written. If adapters for read 2 are given (-A/-G), they are detected and removed as normal, but these matches do not influence where the read pair is written.

To demultiplex using a barcode that is located on read 2, you can “cheat” and swap the roles of R1 and R2 for both the input and output files

cutadapt -e 1 -g ^file:barcodes.fasta -o trimmed-{name}.2.fastq.gz -p trimmed-{name}.1.fastq.gz input.2.fastq.gz input.1.fastq.gz

If you do this in a script or pipeline, it may be a good idea to add a comment to clarify that this reversal of R1 and R2 is intended.

I wonder how to improve this. I think the first step is to print a warning when trying to demultiplex and no adapters were provided for R1.

ryandward commented 1 year ago

Thanks for the clarification. I finally did wind up cheating by swapping the reads, which is totally okay by the way. I think your proposal for a warning makes a lot of sense.

My script will then have to swap the reads back from [2,1] to [1,2] in a downstream process, but it's not at all broken.

Cheers.

marcelm commented 1 year ago

I’ll also see whether I can make it so that you can specify {name2} as a template variable. That would then explicitly declare that you want to use the matches on R2 for demultiplexing. Then no file swapping is necessary.