nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
439 stars 53 forks source link

Wrong values in sm & sd SAM tags #874

Closed vadanamu closed 1 week ago

vadanamu commented 3 weeks ago

Issue Report

Please describe the issue:

After updating to Dorado 0.7.0, the sm and sd tag in the output BAM files contain 'wrong' values. As far as I understand, sm tag is the mean of the signal and sd tag is the dispersion of the signal.

Previous to Dorado 0.7.0 (In my case, 0.4.3), the sm and sd tags contained values such as: no. sm sd 0 76.47213745117188 17.07990837097168 1 77.17393493652344 16.9936466217041 2 76.5774154663086 16.562335968017578 3 77.08621215820312 15.872238159179688 4 73.68836212158203 14.405781745910645 5 77.355224609375 15.354665756225586 6 77.49559020996094 16.30354881286621 7 75.54226684570312 17.252431869506836 8 78.10965728759766 16.47607421875 9 76.51892852783203 17.07990837097168 10 78.1213607788086 15.268403053283691

These values could be used to normalise the signal.

But in Dorado 0.7.0, the sm and sd tags contain values such as: no. sm sd 0 -796.1599731445312 0.008466074243187904 1 -820.1599731445312 0.008466074243187904 2 -825.1599731445312 0.008466074243187904 3 -820.1599731445312 0.008466074243187904 4 -808.1599731445312 0.008466074243187904 5 -793.1599731445312 0.008466074243187904 6 -808.1599731445312 0.008466074243187904 7 -833.1599731445312 0.008466074243187904 8 -809.1599731445312 0.008466074243187904 9 -806.1599731445312 0.008466074243187904 10 -826.1599731445312 0.008466074243187904

The sm tags seems to have been multiplied by -10, and the sd tags all contain the same value(0.008466..) I think that this is a bug introduced in 0.7.0. I'd be grateful for any fix, update, or workaround.

Steps to reproduce the issue:

Run RNA004 basecalling on same POD5 data using dorado 0.7.0 and previous versions. When you compare the BAM/SAM files, will see the differences in the sm and sd tags.

Run environment:

Logs

vadanamu commented 3 weeks ago

After going through the repo, I see that the signal normalisation method has changed for the new basecaller models.

Maybe the dorado/documentation/SAM.md file needs to be updated. It still says 'pA to ~0-mean/1-sd', but it's not correct for the new basecaller models. I think that the values in the updated sm and sd tags has to be applied to ADC values, not pA values.

vadanamu commented 3 weeks ago

Additionally, in my opinion, some documentation, notes, or warning messages regarding this change might be helpful for users who use the sm and sd tags emitted by Dorado.

iiSeymour commented 1 week ago

This was a bug and has been fixed in v0.7.2 - thanks @vadanamu.