rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
9 stars 0 forks source link

What's the meaning of adapter content? #138

Open dudududu12138 opened 3 months ago

dudududu12138 commented 3 months ago

Hi, I used your tool to check the performance of porechop. After I ran porechop, I checked the adapter content in my result file with sequali. But I don't understand the "adapter content" item in the html report. The figure below is my result. I have some questions. Hope you can help me. Thank you!

  1. What's the meaning of y axis ?
  2. Is this a cumulative percentage?
  3. How did you divide the positions into bins and calculate the values of each position bin?
  4. What are the numerator and denominator used to calculate proportions(y axis)? Is it the adapter length as a percentage of the reads length? 1713773465339
rhpvorderman commented 3 months ago

Y axis is the percentage of reads where the 12 bp probe for the adapter has been detected. In your case it seems to be in about ~25% of the cases. (I am not couunting polyA as it is common in the human genome)

  1. Yes, it is cumulative. Otherwise it is not noticable as it would be spread out.
  2. The bins are logarithmic. The bins should be at least 5 bp wide, hence the first part is compressed a little compared to a true logarithmic plot.
  3. It is the number of detections at this position divided by the total number of reads.

Some further advice: do not use porechop. I recommend using cutadapt. Nanopore adapters are very hard to detect because they occur at the ends of the reads where a lot of errors are. Aligning the adapter is almost impossible in these regions. The best solution is to simly cut of the ends. That should also get rid of adapters.

cutadapt -Z --cut 64 --cut -64  --max-average-error-rate 0.1 -o <output.fastq.gz> <input.fastq.gz>

Cutadapt has no issues with compressed input and also supports uBAM reading. --cut cuts off sequences at the begining, or at the end when a negative number is used. This gets rid of any adapters. --max-average-error-rate 0.1 only keeps Q10 reads and better (should be the majority of reads). The -Z flags improves compression performance quite a lot, so I recommend using that with cutadapt always. Unless you want a 5x runtime for a 20% smaller filesize.

dudududu12138 commented 3 months ago

Y axis is the percentage of reads where the 12 bp probe for the adapter has been detected. In your case it seems to be in about ~25% of the cases. (I am not couunting polyA as it is common in the human genome) 2. Yes, it is cumulative. Otherwise it is not noticable as it would be spread out. 3. The bins are logarithmic. The bins should be at least 5 bp wide, hence the first part is compressed a little compared to a true logarithmic plot. 4. It is the number of detections at this position divided by the total number of reads.

Some further advice: do not use porechop. I recommend using cutadapt. Nanopore adapters are very hard to detect because they occur at the ends of the reads where a lot of errors are. Aligning the adapter is almost impossible in these regions. The best solution is to simly cut of the ends. That should also get rid of adapters.

cutadapt -Z --cut 64 --cut -64  --max-average-error-rate 0.1 -o <output.fastq.gz> <input.fastq.gz>

Cutadapt has no issues with compressed input and also supports uBAM reading. --cut cuts off sequences at the begining, or at the end when a negative number is used. This gets rid of any adapters. --max-average-error-rate 0.1 only keeps Q10 reads and better (should be the majority of reads). The -Z flags improves compression performance quite a lot, so I recommend using that with cutadapt always. Unless you want a 5x runtime for a 20% smaller filesize.

Thanks for your reply! I have considered cutadapt, but due to my use of public data, some data did not provide adapter sequence. Do you have recommendation of adapter detector? And now I use another tool to cut adapters: pychopper. The github is https://github.com/epi2me-labs/pychopper . I test the performance of pychopper on the same data as before. The reads are cleaner but shorter. And this is the result: 1713785140026

dudududu12138 commented 3 months ago

I just test cutadapt with your recommend codes on the same data as before:

cutadapt -Z --cut 64 --cut -64  --max-average-error-rate 0.1 -o <output.fastq.gz> <input.fastq.gz>

This is the result: 1713785897496

It seems that cutadapt performed worse than porechop and pychopper. And it seems that cutadapt can not process internal adapters.

rhpvorderman commented 3 months ago

Well technically cutadapt can. But then you should take a look at its adapter features.

PyChopper looks really nice! Thanks for pointing that out to me. I didn't know of that tool. It seems very to the point for nanopore adapter trimming.

If I may ask, how did you hear of sequali? I am trying to get more users. One way is to write a paper about it, but this is not going very fast. (Also I find improving sequali more fun than writing about it).

dudududu12138 commented 3 months ago

Well technically cutadapt can. But then you should take a look at its adapter features.

PyChopper looks really nice! Thanks for pointing that out to me. I didn't know of that tool. It seems very to the point for nanopore adapter trimming.

If I may ask, how did you hear of sequali? I am trying to get more users. One way is to write a paper about it, but this is not going very fast. (Also I find improving sequali more fun than writing about it).

OK, then I will use pychopper to cut adapters. However, when the amount of data is large, the multi-threading feature of this tool seems to fail and no longer produces any results. I noticed your reply on the github issues of cutadapt. I think it's clear to display whether the adapter has been removed clearly.

rhpvorderman commented 3 months ago

OK, then I will use pychopper to cut adapters. However, when the amount of data is large, the multi-threading feature of this tool seems to fail and no longer produces any results.

I have noticed that many nanopore tools, with the exception of the basecaller (dorado) seem to be generally of very poor software quality. It was one of the reasons to start this project. Cutadapt is much better and it aligns the adapters properly, but, it needs to be provided with the correct adapter sequences. It does not do autodetect. This is inconvenient when working with public data. Sequali can give some hints as to what adapters are there, but unfortunately many nanopore adapters have shared sequences, so it is not possible to ascertain for certain.

I noticed your reply on the github issues of cutadapt. I think it's clear to display whether the adapter has been removed clearly.

Good to know. Thanks! The adapter detection feature is basically the same what FastQC does. FastQC is very useful for illumina adapters, but it does not detect nanopore ones. Technically the nanopore adapters could have been added to FastQC, but FastQC's string searching algorithm does not scale that well. So Sequali uses state-of-the-art multiple pattern matching to enable fast matching of a multitude of adapter probes.