marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

Feature request: ONT Chimeric read splitting #747

Open rhpvorderman opened 6 months ago

rhpvorderman commented 6 months ago

Currently I am researching ONT possibilities with cutadapt, and it seems that the most basic functionality can be achieved. Unfortunately after the adapters have been adequately cut, sequali still finds adapter sequences.

These are most likely due to chimeric reads, where reads are joined by adapter sequences. These reads should be split. With the newest chemistry the amount of chimeric reads is estimated at 10% (previously around 2%). These chimeric reads are not always split by the sequence provider and historic data may also contain the 2% reads because splitting was not available back then.

Since cutadapt already has a decent alignment algorithm that can detect sequences anywhere in the read, it should be possible to write a routine that detects chimeric reads.

The hard part I guess will be the actual splitting, were one read becomes two or more reads and feed that back into the pipeline. I can imagine that consideration wasn't a thing when cutadapt was designed.

rhpvorderman commented 6 months ago

I did some thinking and research. The best way to approach this is as follows:

  1. Publish the user guide with the current cutadapt code. Chimeric reads are detected by using adapter detection and using --discard to throw them away.
  2. Make a dedicated read splitter. Rather than splitting the read, the longest segment is presented as canonical.
  3. Look how read splitting can be incorporated in the cutadapt single-end pipeline.

3 is quite challenging, but by following the steps, cutadapt will already be useful for nanopore with chimeric reads at step 1, without requiring extra code.