Closed tmassingham-ont closed 7 years ago
Thank you for the clarification! I've add a new section to the conclusions to address this.
Some of my comments are speculative, so I'd be curious for your input: 1) Are ONT training sets all PCRed DNA with no methylation? 2) Would including native methylated DNA in the training set give an easy improvement to accuracy? Or is it more complicated than that?
Finally, I was just curious, what does the 'pirate' mean in the rgr_r94/rgrgr_r94 networks? I've been wondering since I saw this in Scrappie's history: Pirates vs bioinformaticians. I'm not sure who to root for in this competition. On one hand, I am a bioinformatician. But on the other hand, the pirate networks do seem to be doing well. :smile:
We train variously with PCR and native DNA, sometimes just the former. Including native DNA can in principle help, it depends on the rates of methylation as to what a model will learn: e.g. one can imagine the case where methylation occurs at such a low rate in the training data that a model is optimised by simply ignoring this data. On the other hand, yes a model can learn to recognised methylated/modified squiggle, even if it simply labels the squiggle with canonical bases.
Pirate is simply a reference to the fact that in trying to pronounce rgrgr one invariably sounds like a pirate.
Thanks for the quick response - I've updated the text again. And I like the pirate explanation, appropriate seeing as how yesterday was Talk Like a Pirate Day!
I wonder if the ideal solution is to have lots of different trained models included with Albacore. Off the top of my head: human_pcr, human_native, ecoli_pcr, ecoli_native, mixed. It could default to the mixed model but you could manually choose a model which best fits your data (something like --model ecoli_native
) to get the best accuracy. Or (and this is widely speculative, as I have no experience working with neural networks) could Albacore try each model and somehow automatically figure out which one 'fits' best with the data?
And regarding methylation, I suppose my preferred behaviour is what you described: e.g. labelling a 5mC as just C. Actually labelling bases as methylated is cool, but that feels like a separate issue, and I would probably only want those labels if I explicitly asked for them.
We've upgraded the two pirate networks (rgrgr_r94, rgrgr_r95) in Scrappie 1.1.1 to some trained on a more balanced set of genomes.
Thanks - I'll give it a try!
I've just updated the repo with new results, including Scrappie v1.1.1.
Versions 1.1.0 and 1.1.1 also providing an interesting case to show the difference a training set can make, so I added a section comparing the two.
Thanks again!
Great work! Thank you very much.
As of the current releases, Albacore (2.0.2) and Scrappie (1.1.0) are at parity in their basecalling technology, Albacore using the rgrgr ("pirate") network. I suspect the reason for the apparent regression between the rgr_r94 and rgrgr_r94 networks in Scrappie is that the rgrgr_r94 model included was trained from a human-only data set, rather than a mixed set of genomes. As you said, Scrappie is a technology demonstrator and Albacore should be favoured for most uses.
While Nanonet is the only caller that includes the ability to retrain its own networks, Scrappie and Albacore can use networks trained by our open source Sloika project.