rrwick / Basecalling-comparison

A comparison of different Oxford Nanopore basecallers
GNU General Public License v3.0
313 stars 54 forks source link

Regression between Scrappie rgr_r94 and rgrgr_r94 #1

Closed tmassingham-ont closed 7 years ago

tmassingham-ont commented 7 years ago

Great work! Thank you very much.

As of the current releases, Albacore (2.0.2) and Scrappie (1.1.0) are at parity in their basecalling technology, Albacore using the rgrgr ("pirate") network. I suspect the reason for the apparent regression between the rgr_r94 and rgrgr_r94 networks in Scrappie is that the rgrgr_r94 model included was trained from a human-only data set, rather than a mixed set of genomes. As you said, Scrappie is a technology demonstrator and Albacore should be favoured for most uses.

While Nanonet is the only caller that includes the ability to retrain its own networks, Scrappie and Albacore can use networks trained by our open source Sloika project.

rrwick commented 7 years ago

Thank you for the clarification! I've add a new section to the conclusions to address this.

Some of my comments are speculative, so I'd be curious for your input: 1) Are ONT training sets all PCRed DNA with no methylation? 2) Would including native methylated DNA in the training set give an easy improvement to accuracy? Or is it more complicated than that?

Finally, I was just curious, what does the 'pirate' mean in the rgr_r94/rgrgr_r94 networks? I've been wondering since I saw this in Scrappie's history: Pirates vs bioinformaticians. I'm not sure who to root for in this competition. On one hand, I am a bioinformatician. But on the other hand, the pirate networks do seem to be doing well. :smile:

cjw85 commented 7 years ago

We train variously with PCR and native DNA, sometimes just the former. Including native DNA can in principle help, it depends on the rates of methylation as to what a model will learn: e.g. one can imagine the case where methylation occurs at such a low rate in the training data that a model is optimised by simply ignoring this data. On the other hand, yes a model can learn to recognised methylated/modified squiggle, even if it simply labels the squiggle with canonical bases.

Pirate is simply a reference to the fact that in trying to pronounce rgrgr one invariably sounds like a pirate.

rrwick commented 7 years ago

Thanks for the quick response - I've updated the text again. And I like the pirate explanation, appropriate seeing as how yesterday was Talk Like a Pirate Day!

I wonder if the ideal solution is to have lots of different trained models included with Albacore. Off the top of my head: human_pcr, human_native, ecoli_pcr, ecoli_native, mixed. It could default to the mixed model but you could manually choose a model which best fits your data (something like --model ecoli_native) to get the best accuracy. Or (and this is widely speculative, as I have no experience working with neural networks) could Albacore try each model and somehow automatically figure out which one 'fits' best with the data?

rrwick commented 7 years ago

And regarding methylation, I suppose my preferred behaviour is what you described: e.g. labelling a 5mC as just C. Actually labelling bases as methylated is cool, but that feels like a separate issue, and I would probably only want those labels if I explicitly asked for them.

tmassingham-ont commented 7 years ago

We've upgraded the two pirate networks (rgrgr_r94, rgrgr_r95) in Scrappie 1.1.1 to some trained on a more balanced set of genomes.

rrwick commented 7 years ago

Thanks - I'll give it a try!

rrwick commented 7 years ago

I've just updated the repo with new results, including Scrappie v1.1.1.

Versions 1.1.0 and 1.1.1 also providing an interesting case to show the difference a training set can make, so I added a section comparing the two.

Thanks again!