marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
514 stars 129 forks source link

cutadapt pair-adapters almost all sequences end up in the unknown fastq files #741

Closed cpavloud closed 10 months ago

cpavloud commented 10 months ago

Hello

I am running cutadapt/4.1 (with Python 3.7.8) in an HPC cluster. The command I am running is

cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC...TCYAAGGAAGGCAGCAGG -A ^CCTGCTGCCTTCCTTRGA...GGCTTAATCTTTGAGACAAGC -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq

I have attached the output output.txt

The input files were ~80.000 KB each, the unknown fastq files are ~70.000 KB each and my actual files are ~10.000 KB Do you have any idea why this is happening or how to correct it?

The PCR product is ~400 bp so the total amplicon length (including adapters, primer tails etc) is ~568 bp. MiSeq V3 (300bp paired end) was used.

marcelm commented 10 months ago

Can you please clarify which part of the output is not as you expected?

cpavloud commented 10 months ago

I was expecting to have the majority of sequences in the cutadapted-SampleA-18S_R1_001.fastq and cutadapted-SampleA-18S_R2_001.fastq files. Yet, the vast majority of sequences are in the cutadapted-SampleA-unknown_R1_001.fastq and cutadapted-SampleA-unknown_R2_001.fastq files.

marcelm commented 10 months ago

I see, I apparently did not notice the issue title.

At the moment, you don’t know whether it is the R1 adapter that is not found or the R2 one. As a test, try to leave out --pair-adapters and see what changes.

Also, you could have a look at the FASTQ files (using less or so) to get an impression of how the data looks. You should be able to see whether most R1 reads start with GCTTGTCTCAAAGATTAAGCC as you expect (similar for R2).

cpavloud commented 10 months ago

R1 reads start with GCTTGTCTCAAAGATTAAGCC and R2 reads start with CCTGCTGCCTTCCTTRGA as expected. The issue is in the how the reads end... I deleted the --pair-adapters option and things were a little better, but still I lose a lot of sequences: cutadapted-SampleA-18S_R1_001.fastq and cutadapted-SampleA-18S_R2_001.fastq were ~30.000 KB cutadapted-SampleA-unknown_R1_001.fastq and cutadapted-SampleA-unknown_R2_001.fastq were ~50.000 KB

Do you think I can run the command like this? cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC -A ^CCTGCTGCCTTCCTTRGA -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq so, without taking the end of the reads into account?

marcelm commented 10 months ago

The issue is in the how the reads end...

Hm, this should actually not matter. In linked adapters such as -a ^ADAPTER1...ADAPTER2, the second adapter is optional. That is, Cutadapt only requires the ADAPTER1 sequence to be in the read. The ADAPTER2 is detected and removed only if it exists.

I deleted the --pair-adapters option and things were a little better, but still I lose a lot of sequences: cutadapted-SampleA-18S_R1_001.fastq and cutadapted-SampleA-18S_R2_001.fastq were ~30.000 KB cutadapted-SampleA-unknown_R1_001.fastq and cutadapted-SampleA-unknown_R2_001.fastq were ~50.000 KB

Can you please run these two commands:

cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R1_001.fastq
cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R2_001.fastq

What does the report say in both cases? (Just copy and paste it into the issue.)

Do you think I can run the command like this? cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC -A ^CCTGCTGCCTTCCTTRGA -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq so, without taking the end of the reads into account?

You would need to use -g and -G instead of -a and -A, but it could work.

Here is one other thought: You are using {name} in a way which I had not anticipated. It is meant to be used for demultiplexing when you have many adapters (which would actually be barcodes or index sequences). You are using it only to distinguish between "trimmed" and "not trimmed". This should be fine (and is actually a nice trick I haven’t seen before), but maybe Cutadapt has a bug when you do that. You can achieve the same thing in a different way:

cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC...TCYAAGGAAGGCAGCAGG -A ^CCTGCTGCCTTCCTTRGA...GGCTTAATCTTTGAGACAAGC -m 1 -o cutadapted-SampleA_R1_001.fastq -p cutadapted-SampleA_R2_001.fastq --untrimmed-output=cutadapted-SampleA_untrimmed_R1_001.fastq --untrimmed-paired-output=cutadapted-SampleA_untrimmed_R2_001.fastq SampleA_S466_L001_R1_001.fastq  SampleA_S466_L001_R2_001.fastq

If that command works, then there’s a bug that needs to be fixed.

cpavloud commented 10 months ago

cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R1_001.fastq gives this output output1.txt

and cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R2_001.fastq gives this output output2.txt


Running cutadapt --pair-adapters -g 18S=^GCTTGTCTCAAAGATTAAGCC -G ^CCTGCTGCCTTCCTTRGA -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq had the same result as my first try, so almost all reads end up in the unknown fastq files


cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC...TCYAAGGAAGGCAGCAGG -A ^CCTGCTGCCTTCCTTRGA...GGCTTAATCTTTGAGACAAGC -m 1 -o cutadapted-SampleA_R1_001.fastq -p cutadapted-SampleA_R2_001.fastq --untrimmed-output=cutadapted-SampleA_untrimmed_R1_001.fastq --untrimmed-paired-output=cutadapted-SampleA_untrimmed_R2_001.fastq SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq also had the same result output3.txt


marcelm commented 10 months ago

and cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R2_001.fastq gives this output

Sorry, I put in the wrong sequence. That should have been:

cutadapt -g ^CCTGCTGCCTTCCTTRGA -o /dev/null SampleA_S466_L001_R2_001.fastq

Regarding the output for the first command:

Total reads processed: 119,665 Reads with adapters: 42,867 (35.8%) Reads written (passing filters): 119,665 (100.0%)

This and your output3.txt tell me that Cutadapt works as it should, but that there is likely a problem with your data. You said above that "R1 reads start with GCTTGTCTCAAAGATTAAGCC [...]", but this tells me that only ~36% of the R1 reads start with it. Or is that percentage ok (it is better than the 11% you got initially)? In any case, the output of the fixed command I wrote above would be helpful.

As a kind of quality control, try also to see whether the the Illumina adapter occurs in R1:

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -O 10 -o /dev/null SampleA_S466_L001_R1_001.fastq

It shouldn’t occur at all because the molecules are longer than the reads, so if it is found in a non-negligible number of reads, there could be a problem with the library itself. I’m thinking of primer dimers, but my experience with these types of problems is limited.

cpavloud commented 10 months ago

cutadapt -g ^CCTGCTGCCTTCCTTRGA -o /dev/null SampleA_S466_L001_R2_001.fastq gives output2.txt Again, ~30% of the reads start with the primer sequence. It is better than the 11% I got initially, but it still seems very low...

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -O 10 -o /dev/null SampleA_S466_L001_R1_001.fastq gives output_adapter.txt If I read it correctly, the adapter occurs only in 0.2% of the reads, so it seems fine.

marcelm commented 10 months ago

Looking at the statistics for both R1 and R2, I think what is going on is that the primer sequence is in many cases preceded by some nucleotides. However, an anchored adapter (that is, one using the ^ symbol) is found only when it is right at the start of the read.

Here are the detailed statistics for R1 again:

Sequence: GCTTGTCTCAAAGATTAAGCC; Type: anchored 5'; Length: 21; Trimmed: 42867 times

No. of allowed errors: 2

Overview of removed sequences
length  count   expect  max.err error counts
20      71      0.0     2       0 68 3
21      14700   0.0     2       13110 1396 194
22      14877   0.0     2       0 13779 1098
23      13219   0.0     2       0 0 13219

Let me interpret the last row: This describes all adapter matches that removed 23 nucleotides from the read. The 0 0 13219 tells us that the adapter was found 0 times with no errors, 0 times with one error and 13219 times with 2 errors. Since the adapter only has a length of 21 nt, this means that the 13219 occurrences had actually two insertions right at the start. That is, the adapter did not actually start at the beginning. It was only found because two errors were allowed.

It seems to me that your adapters should not actually use the ^ symbol for anchoring. Is this consistent with your expectations?

You can try this command. It should find the primers anywhere within the read as long as they occur in full (o=21/o=18 is the "minimum overlap length"):

cutadapt --pair-adapters -g '18S=^GCTTGTCTCAAAGATTAAGCC;o=21' -G 'CCTGCTGCCTTCCTTRGA;o=18' -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
cpavloud commented 10 months ago

Thank you so much for taking the time to look into this!

Running cutadapt --pair-adapters -g '18S=GCTTGTCTCAAAGATTAAGCC;o=21' -G 'CCTGCTGCCTTCCTTRGA;o=18' -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq worked! This is the output (for your reference): output_non_anchored.txt

Now my cutadapted-SampleA-18S_R1_001.fastq and cutadapted-SampleA-18S_R2_001.fastq are ~70.000 KB each!

marcelm commented 10 months ago

Ah, this looks a lot better! I’m glad I could help. :smile: