marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Feature request: ONT Chimeric read splitting #747

Open rhpvorderman opened 11 months ago

rhpvorderman commented 11 months ago

Currently I am researching ONT possibilities with cutadapt, and it seems that the most basic functionality can be achieved. Unfortunately after the adapters have been adequately cut, sequali still finds adapter sequences.

These are most likely due to chimeric reads, where reads are joined by adapter sequences. These reads should be split. With the newest chemistry the amount of chimeric reads is estimated at 10% (previously around 2%). These chimeric reads are not always split by the sequence provider and historic data may also contain the 2% reads because splitting was not available back then.

Since cutadapt already has a decent alignment algorithm that can detect sequences anywhere in the read, it should be possible to write a routine that detects chimeric reads.

The hard part I guess will be the actual splitting, were one read becomes two or more reads and feed that back into the pipeline. I can imagine that consideration wasn't a thing when cutadapt was designed.

rhpvorderman commented 11 months ago

I did some thinking and research. The best way to approach this is as follows:

  1. Publish the user guide with the current cutadapt code. Chimeric reads are detected by using adapter detection and using --discard to throw them away.
  2. Make a dedicated read splitter. Rather than splitting the read, the longest segment is presented as canonical.
  3. Look how read splitting can be incorporated in the cutadapt single-end pipeline.

3 is quite challenging, but by following the steps, cutadapt will already be useful for nanopore with chimeric reads at step 1, without requiring extra code.

rhpvorderman commented 1 month ago

@marcelm, could you help me a bit with this one? I am in currently investigating how to do this best.

I did find the sequence that dorado uses: https://github.com/nanoporetech/dorado/blob/acec121e438099741b690d49c7bff4bf25e1851c/dorado/splitter/ReadSplitter.h#L66

It uses a s string-matching library to get the position, so in theory cutadapt can leverage its existing alignment algorithm as well.

Since the chimeric read content for R10 chemistry is supposedly around 10%, the easiest approach for now is just to discard these reads rather than deal with splitting.

The sequence is TACTTCGTTCAGTTACGTATTGCT which is 24 bp long. My current approach will be to run cutadapt with the following settings:

That should only match sequences that are fully contained within the read due to the overlap setting. Is that correct?

marcelm commented 1 month ago

@marcelm, could you help me a bit with this one? I am in currently investigating how to do this best.

Sure!

The sequence is TACTTCGTTCAGTTACGTATTGCT which is 24 bp long. My current approach will be to run cutadapt with the following settings:

Looks good, just quoting those options below for which I have comments.

  • --overlap 24 to only allow complete matches

You could use --overlap 99 if you do not want to have to count the bases. (It’s automatically reduced to the length of the adapter.)

  • -e 0.21 to allow 5 mismatches.

You can use -e 5.

  • --revcomp seems sane, as nanopore reads are single end.

Keep in mind this will "normalize" read orientation (the reads for which the adapter was found on the reverse complement will be output reverse-complemented); I’m not sure this is necessary or appropriate. Run your command once with --revcomp and check the report to see whether you get a significant portion of matches to the reverse complement.

That should only match sequences that are fully contained within the read due to the overlap setting. Is that correct?

Yes!

rhpvorderman commented 1 month ago

So I tried this and it seems from the output that is most likely that only false positives are found. (Matches with 5 errors where massively overrepresented) . Also I found this: https://github.com/nanoporetech/dorado/blob/release-v0.8/documentation/SAM.md

Turns out there is a pi:Z: tag, that contains the parent ID for a split read. So if pi tags are present, dorado has already done the read splitting. And that appears to be the case for my dataset.

I am glad this is in the metadata. Makes my job a whole lot easier.