Obtaining reliable ground-truth

maximilianmordig commented 10 months ago

Hello Marc Thank you for your great work comparing basecallers.

I have a clarification question regarding whether the (ground-truth) training data is reliable. As you describe in your methods section, given a raw signal, you use an existing basecaller (guppy) to get a rough basecalled sequence and then align it to a known reference genome, followed by tombo resquiggle. Is the purpose of this to correct for errors introduced with the existing basecaller so that you can surpass its performance? How was this done in the early days when no existing basecaller was available? Did they measure each single molecule that went through the pore with some chemical procedure as well to get ground-truth nucleotides?

marcpaga commented 10 months ago

Is the purpose of this to correct for errors introduced with the existing basecaller so that you can surpass its performance?

Yes, otherwise you would train a model to mimic an existing basecaller.

How was this done in the early days when no existing basecaller was available? Did they measure each single molecule that went through the pore with some chemical procedure as well to get ground-truth nucleotides?

I do not know the answer to this, ONT would be the go-to to get an answer to this question.

A naive approach would be to sequence some molecules for which they knew what the ground truth. For example, one could PCR a particular DNA sequence and only sequence that. You can also Sanger sequence such amplicons to obtain a ground truth. If you repeat this many times, for difference sequences, at some point you obtain enough ground truth data to train a basecaller.

maximilianmordig commented 10 months ago

Thanks.

marcpaga / basecalling_architectures

Obtaining reliable ground-truth #6