nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
504 stars 60 forks source link

Scoring options not work when demultiplexing using custom barcodes #543

Closed xieyy46 closed 1 month ago

xieyy46 commented 10 months ago

Hi dorado team, Thank you for your excellent work! I have tried to demultiplex my reads using custom barcodes, by specifying the custom_barcodes.toml and custom_barcodes.fa. However, I found that no matter how I modify min_soft_flank_threshold and min_hard_flank_threshold (I tried 0.05, 0.1, 0.2, 0.3, 0.5), the number of unclassified reads remains unchanged.

the command line: dorado demux dorado/calls.bam --barcode-arrangement dorado/custom_barcodes.toml --barcode-sequences dorado/custom_barcodes.fa --output-dir dorado/demux -t 25

tijyojwad commented 10 months ago

Hi @xieyy46 - can you post the toml file you're using?

tijyojwad commented 10 months ago

are you getting any classifications? and are the reads short? e.g. <150 bp?

xieyy46 commented 10 months ago

Hi, the toml file I used is shown below, image

and I did got normal barcode classifications results (and the reads are not short reads),

the only problem is that no matter how I modify min_soft_flank_threshold and min_hard_flank_threshold, the number of unclassified reads remains unchanged (about 12% of all reads).

tijyojwad commented 10 months ago

can you try with v0.5.1? there was on parameter that was not taken from the config, perhaps that's what is affecting your results

xieyy46 commented 10 months ago

I tried dorado v0.5.1, but again, min_soft_flank_threshold and min_hard_flank_threshold setting in the toml file did not work

tijyojwad commented 10 months ago

Can you check how long the unclassified reads are?

xieyy46 commented 10 months ago

I checked the read lengths, but did not the read lengths of unclassified reads special. image

Can you conveniently test the issue with your own data?

tijyojwad commented 10 months ago

Gotcha, yeah. You can debug in more detail by looking at the alignment for a specific read. e.g. if you pick a read id from the unclassified bam, you can run

$ echo <read-id> > reads.txt
$ dorado demux dorado/calls.bam --barcode-arrangement dorado/custom_barcodes.toml --barcode-sequences dorado/custom_barcodes.fa --output-dir dorado/demux -t 25 --read-ids reads.txt -vv

This will run dorado in trace mode and output detailed alignments and scoring, etc. Would be interesting to see what's happening

tijyojwad commented 9 months ago

Hi @xieyy46 any updates on this?

xieyy46 commented 9 months ago

Hi, I apologize for the delay. I'll check this later.

xieyy46 commented 9 months ago

Hi, I just selected one read to run dorado demux in trace mode.

I set the Scoring options in the toml file as shown below: [scoring] min_soft_barcode_threshold = 0.2 min_hard_barcode_threshold = 0.2 min_soft_flank_threshold = 0.4 min_hard_flank_threshold = 0.4 min_barcode_score_dist = 0.05

The real flank score is 0.95238096, and scores for each barcode are: 0.583333 BC01, 0.5 BC03, 0.416667 BC02, 0.375 BC04. Both flank score and barcode scores is above the threshold, and the scores difference between best barcode and second-best barcode is above 0.05. So this read should be classified as BC01. However, actually this read was unclassified to any barcode by dorado.

So scoring options setting in the toml file were not received by dorado?

xieyy46 commented 9 months ago

By the way, I noticed another issue that the candidate barcode sequence extracted before the mask1_rear shift one base towards the rear sequence. May be an error in the dorado demux code?

A instance is shown below: image

the candidate barcode sequence should be "AAAAAAGTTGTCGGTGTCTTTGTG",

however, dorado extracted "AAAAAGTTGTCGGTGTCTTTGTGC",

one base shift?

tijyojwad commented 9 months ago

Ah thanks for posting the trace details, now I can see what's going on. There are 2 issues here (both would need to be addressed in another build of dorado) -

  1. there's still one hard code filter in the code that checks that top 2 scores are at least 0.1 apart. That was kept under the assumption that anything that close in scores is likely not safe to classify anyway. And since your scores are less than that apart, it's being ignored. We can make that completely overridden by the min(config value, 0.1).
  2. there does seem to be an off by 1 error when extracting the bc sequence if there are no front flanks at least. I will get a fix in for this.
tijyojwad commented 5 months ago

Hi @xieyy46 - since dorado v0.6.0 we've improved barcoding setup. the config parameters are now also different (details on here ) and hopefully more relatable. Can you try again with the newer build?