Closed cpavloud closed 10 months ago
Can you please clarify which part of the output is not as you expected?
I was expecting to have the majority of sequences in the cutadapted-SampleA-18S_R1_001.fastq
and cutadapted-SampleA-18S_R2_001.fastq
files.
Yet, the vast majority of sequences are in the cutadapted-SampleA-unknown_R1_001.fastq
and cutadapted-SampleA-unknown_R2_001.fastq
files.
I see, I apparently did not notice the issue title.
At the moment, you don’t know whether it is the R1 adapter that is not found or the R2 one. As a test, try to leave out --pair-adapters
and see what changes.
Also, you could have a look at the FASTQ files (using less
or so) to get an impression of how the data looks. You should be able to see whether most R1 reads start with GCTTGTCTCAAAGATTAAGCC
as you expect (similar for R2).
R1 reads start with GCTTGTCTCAAAGATTAAGCC
and R2 reads start with CCTGCTGCCTTCCTTRGA
as expected. The issue is in the how the reads end...
I deleted the --pair-adapters
option and things were a little better, but still I lose a lot of sequences:
cutadapted-SampleA-18S_R1_001.fastq
and cutadapted-SampleA-18S_R2_001.fastq
were ~30.000 KB
cutadapted-SampleA-unknown_R1_001.fastq
and cutadapted-SampleA-unknown_R2_001.fastq
were ~50.000 KB
Do you think I can run the command like this?
cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC -A ^CCTGCTGCCTTCCTTRGA -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
so, without taking the end of the reads into account?
The issue is in the how the reads end...
Hm, this should actually not matter. In linked adapters such as -a ^ADAPTER1...ADAPTER2
, the second adapter is optional. That is, Cutadapt only requires the ADAPTER1
sequence to be in the read. The ADAPTER2
is detected and removed only if it exists.
I deleted the
--pair-adapters
option and things were a little better, but still I lose a lot of sequences:cutadapted-SampleA-18S_R1_001.fastq
andcutadapted-SampleA-18S_R2_001.fastq
were ~30.000 KBcutadapted-SampleA-unknown_R1_001.fastq
andcutadapted-SampleA-unknown_R2_001.fastq
were ~50.000 KB
Can you please run these two commands:
cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R1_001.fastq
cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R2_001.fastq
What does the report say in both cases? (Just copy and paste it into the issue.)
Do you think I can run the command like this?
cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC -A ^CCTGCTGCCTTCCTTRGA -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
so, without taking the end of the reads into account?
You would need to use -g
and -G
instead of -a
and -A
, but it could work.
Here is one other thought: You are using {name}
in a way which I had not anticipated. It is meant to be used for demultiplexing when you have many adapters (which would actually be barcodes or index sequences). You are using it only to distinguish between "trimmed" and "not trimmed". This should be fine (and is actually a nice trick I haven’t seen before), but maybe Cutadapt has a bug when you do that. You can achieve the same thing in a different way:
cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC...TCYAAGGAAGGCAGCAGG -A ^CCTGCTGCCTTCCTTRGA...GGCTTAATCTTTGAGACAAGC -m 1 -o cutadapted-SampleA_R1_001.fastq -p cutadapted-SampleA_R2_001.fastq --untrimmed-output=cutadapted-SampleA_untrimmed_R1_001.fastq --untrimmed-paired-output=cutadapted-SampleA_untrimmed_R2_001.fastq SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
If that command works, then there’s a bug that needs to be fixed.
cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R1_001.fastq
gives this output
output1.txt
and
cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R2_001.fastq
gives this output
output2.txt
Running
cutadapt --pair-adapters -g 18S=^GCTTGTCTCAAAGATTAAGCC -G ^CCTGCTGCCTTCCTTRGA -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
had the same result as my first try, so almost all reads end up in the unknown fastq files
cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC...TCYAAGGAAGGCAGCAGG -A ^CCTGCTGCCTTCCTTRGA...GGCTTAATCTTTGAGACAAGC -m 1 -o cutadapted-SampleA_R1_001.fastq -p cutadapted-SampleA_R2_001.fastq --untrimmed-output=cutadapted-SampleA_untrimmed_R1_001.fastq --untrimmed-paired-output=cutadapted-SampleA_untrimmed_R2_001.fastq SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
also had the same result
output3.txt
and
cutadapt -g ^GCTTGTCTCAAAGATTAAGCC -o /dev/null SampleA_S466_L001_R2_001.fastq
gives this output
Sorry, I put in the wrong sequence. That should have been:
cutadapt -g ^CCTGCTGCCTTCCTTRGA -o /dev/null SampleA_S466_L001_R2_001.fastq
Regarding the output for the first command:
Total reads processed: 119,665 Reads with adapters: 42,867 (35.8%) Reads written (passing filters): 119,665 (100.0%)
This and your output3.txt
tell me that Cutadapt works as it should, but that there is likely a problem with your data. You said above that "R1 reads start with GCTTGTCTCAAAGATTAAGCC [...]", but this tells me that only ~36% of the R1 reads start with it. Or is that percentage ok (it is better than the 11% you got initially)? In any case, the output of the fixed command I wrote above would be helpful.
As a kind of quality control, try also to see whether the the Illumina adapter occurs in R1:
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -O 10 -o /dev/null SampleA_S466_L001_R1_001.fastq
It shouldn’t occur at all because the molecules are longer than the reads, so if it is found in a non-negligible number of reads, there could be a problem with the library itself. I’m thinking of primer dimers, but my experience with these types of problems is limited.
cutadapt -g ^CCTGCTGCCTTCCTTRGA -o /dev/null SampleA_S466_L001_R2_001.fastq
gives output2.txt
Again, ~30% of the reads start with the primer sequence.
It is better than the 11% I got initially, but it still seems very low...
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -O 10 -o /dev/null SampleA_S466_L001_R1_001.fastq
gives output_adapter.txt
If I read it correctly, the adapter occurs only in 0.2% of the reads, so it seems fine.
Looking at the statistics for both R1 and R2, I think what is going on is that the primer sequence is in many cases preceded by some nucleotides. However, an anchored adapter (that is, one using the ^
symbol) is found only when it is right at the start of the read.
Here are the detailed statistics for R1 again:
Sequence: GCTTGTCTCAAAGATTAAGCC; Type: anchored 5'; Length: 21; Trimmed: 42867 times
No. of allowed errors: 2
Overview of removed sequences
length count expect max.err error counts
20 71 0.0 2 0 68 3
21 14700 0.0 2 13110 1396 194
22 14877 0.0 2 0 13779 1098
23 13219 0.0 2 0 0 13219
Let me interpret the last row: This describes all adapter matches that removed 23 nucleotides from the read. The 0 0 13219
tells us that the adapter was found 0 times with no errors, 0 times with one error and 13219 times with 2 errors. Since the adapter only has a length of 21 nt, this means that the 13219 occurrences had actually two insertions right at the start. That is, the adapter did not actually start at the beginning. It was only found because two errors were allowed.
It seems to me that your adapters should not actually use the ^
symbol for anchoring. Is this consistent with your expectations?
You can try this command. It should find the primers anywhere within the read as long as they occur in full (o=21
/o=18
is the "minimum overlap length"):
cutadapt --pair-adapters -g '18S=^GCTTGTCTCAAAGATTAAGCC;o=21' -G 'CCTGCTGCCTTCCTTRGA;o=18' -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
Thank you so much for taking the time to look into this!
Running
cutadapt --pair-adapters -g '18S=GCTTGTCTCAAAGATTAAGCC;o=21' -G 'CCTGCTGCCTTCCTTRGA;o=18' -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
worked!
This is the output (for your reference): output_non_anchored.txt
Now my cutadapted-SampleA-18S_R1_001.fastq
and cutadapted-SampleA-18S_R2_001.fastq
are ~70.000 KB each!
Ah, this looks a lot better! I’m glad I could help. :smile:
Hello
I am running cutadapt/4.1 (with Python 3.7.8) in an HPC cluster. The command I am running is
cutadapt --pair-adapters -a 18S=^GCTTGTCTCAAAGATTAAGCC...TCYAAGGAAGGCAGCAGG -A ^CCTGCTGCCTTCCTTRGA...GGCTTAATCTTTGAGACAAGC -m 1 -o "cutadapted-SampleA-{name}_R1_001.fastq" -p "cutadapted-SampleA-{name}_R2_001.fastq" SampleA_S466_L001_R1_001.fastq SampleA_S466_L001_R2_001.fastq
I have attached the output output.txt
The input files were ~80.000 KB each, the unknown fastq files are ~70.000 KB each and my actual files are ~10.000 KB Do you have any idea why this is happening or how to correct it?
The PCR product is ~400 bp so the total amplicon length (including adapters, primer tails etc) is ~568 bp. MiSeq V3 (300bp paired end) was used.