nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
390 stars 120 forks source link

Dropped ctc_min_accuracy/coverage parameter #214

Open jdemeul opened 2 years ago

jdemeul commented 2 years ago

I was wondering why the ctc_min_accuracy (and ctc_min_coverage) CLI parameter was dropped when generating training data and now seems to be fixed at 0.99 (and 0.9, respectively).

I'm analysing highly modified DNA (5mC and 6mA) which results in lower accuracy of the out-of-the-box basecalling models. To increase accuracy I was refining the v3.1-3.3 models, and lowering the ctc_min_accuracy during training set generation to avoid biasing against highly modified (and hence more poorly basecalled) reads. After retraining, the model has basically learned to ignore the modifications and does considerably better (example in the figure below).

The issue is that with a fixed ctc_min_accuracy threshold at 0.99, only ~0.5% of my training data is written out and this data set is likely to be biased towards non-representative unmodified reads.

GuppySUP-v33-v33refined-v33X2refined-denovo_alignmentaccuracy

andreaswallberg commented 8 months ago

Dear @jdemeul did you manage to find a solution?

We are seeing coverage issues in training step resulting in few accepted reads used to build the model, and later on, short basecalled reads with our r10 data.

jdemeul commented 8 months ago

Hi @andreaswallberg, unfortunately I don't think anything has changed on this (even after the recent debacle of the released Dorado models performing poorly on bacterial DNA due to overtraining). I ended up just modifying it myself in the source code here.

andreaswallberg commented 8 months ago

Hi @jdemeul , thanks for the feedback! Much appreciated.

Could you please indicate to values you changed those parameters?

Could you link to (or send a pm or email, I am sure you can find me) discussions regarding that bacterial basecalling? We are exploring the possibility that 0.4.3 models misbehave for us, so it would be good to get some extra hints here.

jdemeul commented 8 months ago

I'd need to have a look what values I used exactly, but I'd anyhow base them on the alignment accuracies you're getting using your own data (this was all R9.4.1 anyhow). I'd recommend generating the plot above and picking your ctc_min_accuracy threshold based on that to accept the bulk of your data with the standard model you're using (based on the plot above, I probably used 0.85, followed by a second refinement with higher thresholds and the first refined model).

For the bacterial base calling research model, see the post on the Nanopore community website here and the actual model res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation on the rerio github.