nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
477 stars 59 forks source link

dorado demux: barcode recognition, "false positives" and "--barcode-both-ends" #961

Closed sklages closed 3 weeks ago

sklages commented 1 month ago

In quite a few datasets (with relatively small libraries) we observe a dramatic increase in "unclassified" data when using --barcode-both-ends, from 10% or less to over 50% and more.

I was wondering what are the criteria of unique barcode recognition when classifying with dorado demux? Does the demultiplexer look for identity? Are mismatches allowed?

What does it take for a barcode to become a "false positive" (a potential problem when using single-ended barcode recognition)? The barcode sequences are 24nt; does "false positive" mean that e.g. "NB01" mutates to "NB10" of the same barcode set?

Or is this not just a base calling issue, but another reason for "false positives" during barcoding?

malton-ont commented 1 month ago

Hi @sklages,

There are a number of checks to determine the barcode match. All start with selecting the "best" (lowest edit distance) result from the top and bottom barcodes in both variants.

For a double-ended barcode, we also look at the best results for the top and bottom barcodes. If they are both confident in their classification (edit distance <= max_barcode_penalty) and are close in value to each other but they disagree on the barcode selected then we mark the read as unclassified.

The --barcode-both-ends flag adds an additional check that the barcode penalty for the two ends individually must both be no greater than max_barcode_penalty. Without this flag we accept the more confident barcode (except in the above circumstance).

False positives occur when the "wrong" barcode matches more closely than the "correct" one. This may be due to sequencing errors but it can also occur if, for example, dorado has failed to split a read correctly (though we also mark as unclassified any reads where we confidently detect the flank regions somewhere in the middle of the sequence).

All of this is found in https://github.com/nanoporetech/dorado/blob/release-v0.7/dorado/demux/BarcodeClassifier.cpp#L884 if you want to dive a bit deeper.

sklages commented 1 month ago

@malton-ont - Thanks for the detailed explanation.