Open jdemeul opened 2 years ago
Dear @jdemeul did you manage to find a solution?
We are seeing coverage issues in training step resulting in few accepted reads used to build the model, and later on, short basecalled reads with our r10 data.
Hi @andreaswallberg, unfortunately I don't think anything has changed on this (even after the recent debacle of the released Dorado models performing poorly on bacterial DNA due to overtraining). I ended up just modifying it myself in the source code here.
Hi @jdemeul , thanks for the feedback! Much appreciated.
Could you please indicate to values you changed those parameters?
Could you link to (or send a pm or email, I am sure you can find me) discussions regarding that bacterial basecalling? We are exploring the possibility that 0.4.3 models misbehave for us, so it would be good to get some extra hints here.
I'd need to have a look what values I used exactly, but I'd anyhow base them on the alignment accuracies you're getting using your own data (this was all R9.4.1 anyhow).
I'd recommend generating the plot above and picking your ctc_min_accuracy
threshold based on that to accept the bulk of your data with the standard model you're using (based on the plot above, I probably used 0.85, followed by a second refinement with higher thresholds and the first refined model).
For the bacterial base calling research model, see the post on the Nanopore community website here and the actual model res_dna_r10.4.1_e8.2_400bps_sup@2023-09-22_bacterial-methylation
on the rerio github.
I was wondering why the
ctc_min_accuracy
(andctc_min_coverage
) CLI parameter was dropped when generating training data and now seems to be fixed at 0.99 (and 0.9, respectively).I'm analysing highly modified DNA (5mC and 6mA) which results in lower accuracy of the out-of-the-box basecalling models. To increase accuracy I was refining the v3.1-3.3 models, and lowering the
ctc_min_accuracy
during training set generation to avoid biasing against highly modified (and hence more poorly basecalled) reads. After retraining, the model has basically learned to ignore the modifications and does considerably better (example in the figure below).The issue is that with a fixed
ctc_min_accuracy
threshold at 0.99, only ~0.5% of my training data is written out and this data set is likely to be biased towards non-representative unmodified reads.