dorado demux: barcode recognition, "false positives" and "--barcode-both-ends"

sklages commented 1 month ago

In quite a few datasets (with relatively small libraries) we observe a dramatic increase in "unclassified" data when using --barcode-both-ends, from 10% or less to over 50% and more.

I was wondering what are the criteria of unique barcode recognition when classifying with dorado demux? Does the demultiplexer look for identity? Are mismatches allowed?

What does it take for a barcode to become a "false positive" (a potential problem when using single-ended barcode recognition)? The barcode sequences are 24nt; does "false positive" mean that e.g. "NB01" mutates to "NB10" of the same barcode set?

Or is this not just a base calling issue, but another reason for "false positives" during barcoding?

malton-ont commented 1 month ago

Hi @sklages,

There are a number of checks to determine the barcode match. All start with selecting the "best" (lowest edit distance) result from the top and bottom barcodes in both variants.

The most basic check is whether the edit distance of the barcode sequence (+padding) is not greater than max_barcode_penalty, and that the flank score is not less than min_flank_score
If these pass, and the best result is sufficiently better than the second best result (by at least min_barcode_penalty_dist), we accept the classification
Otherwise we check that the best result is at least min_separation_only_dist better than the second result, and that the barcodes were found close enough to the ends of the read.

For a double-ended barcode, we also look at the best results for the top and bottom barcodes. If they are both confident in their classification (edit distance <= max_barcode_penalty) and are close in value to each other but they disagree on the barcode selected then we mark the read as unclassified.

The --barcode-both-ends flag adds an additional check that the barcode penalty for the two ends individually must both be no greater than max_barcode_penalty. Without this flag we accept the more confident barcode (except in the above circumstance).

False positives occur when the "wrong" barcode matches more closely than the "correct" one. This may be due to sequencing errors but it can also occur if, for example, dorado has failed to split a read correctly (though we also mark as unclassified any reads where we confidently detect the flank regions somewhere in the middle of the sequence).

All of this is found in https://github.com/nanoporetech/dorado/blob/release-v0.7/dorado/demux/BarcodeClassifier.cpp#L884 if you want to dive a bit deeper.

sklages commented 1 month ago

@malton-ont - Thanks for the detailed explanation.

nanoporetech / dorado

dorado demux: barcode recognition, "false positives" and "--barcode-both-ends" #961