Open rhpvorderman opened 11 months ago
I did some thinking and research. The best way to approach this is as follows:
3 is quite challenging, but by following the steps, cutadapt will already be useful for nanopore with chimeric reads at step 1, without requiring extra code.
@marcelm, could you help me a bit with this one? I am in currently investigating how to do this best.
I did find the sequence that dorado uses: https://github.com/nanoporetech/dorado/blob/acec121e438099741b690d49c7bff4bf25e1851c/dorado/splitter/ReadSplitter.h#L66
It uses a s string-matching library to get the position, so in theory cutadapt can leverage its existing alignment algorithm as well.
Since the chimeric read content for R10 chemistry is supposedly around 10%, the easiest approach for now is just to discard these reads rather than deal with splitting.
The sequence is TACTTCGTTCAGTTACGTATTGCT
which is 24 bp long. My current approach will be to run cutadapt with the following settings:
--cut 60
--cut -60
to cut of ends that always contain adapters
--minimum-length 500
to not bother with very short sequences.
--maximum-average-error-rate 0.1
to not bother with low quality sequences
-a TACTTCGTTCAGTTACGTATTGCT
--overlap 24
to only allow complete matches
-e 0.21
to allow 5 mismatches.
--action none
I want to keep them to see what kind of results I get
--revcomp
seems sane, as nanopore reads are single end.
--untrimmed-output filtered.fastq.gz
--output discarded.fastq.gz
That should only match sequences that are fully contained within the read due to the overlap setting. Is that correct?
@marcelm, could you help me a bit with this one? I am in currently investigating how to do this best.
Sure!
The sequence is
TACTTCGTTCAGTTACGTATTGCT
which is 24 bp long. My current approach will be to run cutadapt with the following settings:
Looks good, just quoting those options below for which I have comments.
--overlap 24
to only allow complete matches
You could use --overlap 99
if you do not want to have to count the bases. (It’s automatically reduced to the length of the adapter.)
-e 0.21
to allow 5 mismatches.
You can use -e 5
.
--revcomp
seems sane, as nanopore reads are single end.
Keep in mind this will "normalize" read orientation (the reads for which the adapter was found on the reverse complement will be output reverse-complemented); I’m not sure this is necessary or appropriate. Run your command once with --revcomp
and check the report to see whether you get a significant portion of matches to the reverse complement.
That should only match sequences that are fully contained within the read due to the overlap setting. Is that correct?
Yes!
So I tried this and it seems from the output that is most likely that only false positives are found. (Matches with 5 errors where massively overrepresented) . Also I found this: https://github.com/nanoporetech/dorado/blob/release-v0.8/documentation/SAM.md
Turns out there is a pi:Z:
tag, that contains the parent ID for a split read. So if pi
tags are present, dorado has already done the read splitting. And that appears to be the case for my dataset.
I am glad this is in the metadata. Makes my job a whole lot easier.
Currently I am researching ONT possibilities with cutadapt, and it seems that the most basic functionality can be achieved. Unfortunately after the adapters have been adequately cut, sequali still finds adapter sequences.
These are most likely due to chimeric reads, where reads are joined by adapter sequences. These reads should be split. With the newest chemistry the amount of chimeric reads is estimated at 10% (previously around 2%). These chimeric reads are not always split by the sequence provider and historic data may also contain the 2% reads because splitting was not available back then.
Since cutadapt already has a decent alignment algorithm that can detect sequences anywhere in the read, it should be possible to write a routine that detects chimeric reads.
The hard part I guess will be the actual splitting, were one read becomes two or more reads and feed that back into the pipeline. I can imagine that consideration wasn't a thing when cutadapt was designed.