marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
513 stars 129 forks source link

I can´t separate trimmed and no-trimmed reads #661

Open lavdn opened 1 year ago

lavdn commented 1 year ago

Hello! I am writing seeking help with primer removal in cutadapt. I have Paired-end sequences, sequenced with Illumina. Illumina's own software removes index and adapters that are not of interest to me in the sequence, but keeps the amplification primers that I use to obtain the fragment of interest. My sequences are 300 bp, that is, 150 bp in R1 and 150 in R2, files that I have to merge to get my fragment of interest.

My problem appears when removing the primers from the sequences, since what I want is to obtain a file with the reads where any of the primers have been found (and delete them) and another file with the reads that They have not matched any of the primers. For that I use the following command line:

cutadapt -g file:primers.fasta -o outputname.fastq input

I have tried it on raw sequence files and it gave me the odd problem when applying filters afterwards, especially filtering sequences by size; in which I removed almost all the sequences after removing the primers. However, if I apply this command to remove primers after applying the size and quality filters, it does work.

My biggest problem is that in the output file I have both types of sequences, where it finds some primer and removes it, and also sequences where no match was found.

Is there a way to separate this into two files?

I hope you understand me, Any help is more than welcome!! Thank you!!

marcelm commented 1 year ago

I am not entirely sure about your setup, but please check whether this section in the documentation applies to your situation: https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-paired-end-reads

To separate trimmed and untrimmed output, you can use the option --untrimmed-output. I have just added a section to the documentation that should explain it: https://cutadapt.readthedocs.io/en/latest/recipes.html#separating-trimmed-and-untrimmed-reads

Please let me know if this answers your question!

lavdn commented 1 year ago

All my amplicons are approximately 300 bp, and Illumina chemistry allows sequencing of 150 bp, so in all cases the read will contain only one of the primers, because it is not enough to sequence such a large size per read. . That is, once I merge the reads I will get my 300 bp amplicon, but in each of the files (R1 and R2) I will have reads of a maximum of 150 bp, which are smaller than my amplicon, so , as I understand in the instructions that you provide me, I would only have to use -g ^FWDPRIMER -G ^REVPRIMER --discard-untrimmed to remove my primers from each of the sequences. Thank you very much for the section separating trimmed and non-trimmed. I'll try right now and hope this works! Thanks so much for the help! Cheers

marcelm commented 1 year ago

Great! Not sure if it’s clear: The instructions in the amplicon primer trimming section assume that you do not need to keep the untrimmed reads, and then using --discard-untrimmed is the correct option. If you want to keep them, you need to use --untrimmed-output and --untrimmed-paired-output instead.

I’ll close this issue now, but feel free to comment if anything is still unclear.

lavdn commented 1 year ago

Yes! I want to solve another question, when I use -q command to filter reads which quality is not enough, I think that I making a mistake, because I want to filter reads which quality mean (not quality of the ends of de read) is not enough. So, I think -q is not the appropriate command for this. Can u help me with that?

marcelm commented 1 year ago

The only option in Cutadapt that looks at all the quality values in a read is --max-expected-errors: https://cutadapt.readthedocs.io/en/stable/reference.html#filtering-of-processed-reads I don’t use it myself so you would need to find out yourself which value works best for you.

lavdn commented 1 year ago

I am very sorry for the delay, I have had a lot of work to do and I have abandoned this topic. Thank you very much for your help, I will see how I can solve the quality problem.

On the other hand, if you could help me again, I have a problem performing the filtering. I'll tell you, if I use the fastq file resulting from filtering the raw data by quality and size, and trimming the primers on it, this is successful. However, if I run all the filters at once on the raw data file, the tool removes far more streams than it should. In the first case I had 147692 reads that remained at 146882 once I apply the filters separately, but when I apply them all on the same command line I only have 442 reads left. Can you tell me why this is happening to me?

The thing is that I am in the process of creating an analysis plugin for NGS sequences and if I implement the filters as I explain above, all in the same command line, I get the problem of filtering too many reads.

Many thanks.

marcelm commented 1 year ago

I would have to see the exact command that you used before I could say anything. Feel free to replace any primer sequences with dummy sequences in case you want those to remain secret.

lavdn commented 1 year ago

Okey!!

If I use the archive resulting from filtering the raw data, I use to do this:

cutadapt -q 20,20 -l 100 -o output.R1.fastq input.R1.fastq.gz 

and after that, I use:

cutadapt \
  -g ^file:primers.fasta \
  -G ^file:primers.fasta \
  --untrimmed-output=untrimmed.1.fastq.gz \
  --untrimmed-paired-output=untrimmed.2.fastq.gz
  -o trimmed.1.fastq.gz \
  -p trimmed.2.fastq.gz \
  input.1.fastq \
  input.2.fastq

And I have no problem. But if I use:

cutadapt \
  -l 100 \
  -q 20,20 \
  -g ^file:primers.fasta \
  -G ^file:primers.fasta \
  --untrimmed-output=untrimmed.1.fastq.gz \
  --untrimmed-paired-output=untrimmed.2.fastq.gz
  -o trimmed.1.fastq.gz \
  -p trimmed.2.fastq.gz \
  input.1.fastq.gz \
  input.2.fastq.gz

That command line doesn´t work succesfully, because discard more reads than should.

marcelm commented 1 year ago

A couple of differences between these two versions come to mind.

The command

cutadapt -q 20,20 -l 100 -o output.R1.fastq input.R1.fastq.gz 

only changes the R1 reads. Do you run a similar command for R2 as well?

(In general, avoid treating paired-end reads as two single-end datasets. This bypasses Cutadapt’s logic that checks whether the reads are properly paired. If you use the wrong option, the data gets out of sync.)

Also, I don’t think I would recommend quality-trimming the 5' end when removing anchored 5' adapters. When just a couple of bases are quality trimmed, the adapter won’t be found anymore since quality trimming is done before adapter removal. So I suggest just -q 20 instead of -q 20,20.

In the version where you run Cutadapt twice, length trimming happens first, but when you use a single command, length trimming is always applied after adapter trimming.

lavdn commented 1 year ago

Hi again, I've been on vacation.

Thanks a lot! Right now I will take into account all this new information. I would like to ask finally, when cutadapt gives me a warning that the R1 and R2 files have some pairing error, does it not run the commands? I'll have problems with the merge later, right? What could have happened and how can I fix it?

marcelm commented 1 year ago

when cutadapt gives me a warning that the R1 and R2 files have some pairing error, does it not run the commands?

Yes, it will abort.

I'll have problems with the merge later, right?

Which merge do you mean?

What could have happened and how can I fix it?

This dependes entirely on what you did with the files. The files that come from the sequencer will be correctly paired. Only tools used subsequently can cause problems.