nanoporetech / taiyaki

Training models for basecalling Oxford Nanopore reads
https://nanoporetech.com/
Other
113 stars 42 forks source link

Size training set ? #5

Open zimoun opened 5 years ago

zimoun commented 5 years ago

Dear,

I am leaving the Nanopore Day in Bordeaux and there Stephen Rudd presented the release of this promising tools.

What should be the size of the training set? Number of reads of the same reference?

I have a couple of experiments where I am able to link 1-to-1 the reads and the real expected sequence. Before writing some code to clean my data and format it to be acceptable by taiyaki, I would like to know if I can expect any improvement or not. :-)

Thank you in advance for any comments.

myrtlecat commented 5 years ago

Hello, and thanks for your interest in the project.

You are asking very good questions, but unfortunately it is difficult to give simple answers. It will depend very much on your sample prep and the aims of your project.

It's worth repeating that taiyaki is a tool for research. For "normal" DNA or RNA samples, most users should just use the models that ship with Guppy. We think 2 main groups of people will benefit from taiyaki:

  1. People with an interest in machine learning, who would like to research ways to improve over our model training methods.
  2. People using nanopore sequencing to analyse something "unusual" that the default models in Guppy don't handle well (e.g. chemically modified bases, DNA extracted from aliens, synthetic XNAs, etc...)

I can offer some general guidelines, but it help me to offer more specific advice if you can answer a few questions about your project:

If you don't want to give too many details publicly then you can drop me an email and we can discuss it further in private (joe.harvey@nanoporetch.com).

Now follows some generic advice on training set size:

Training set size

There are two important measures of training set size:

  1. The total number of bases
  2. The total amount of unique reference sequence

The first measure is always bigger than the second, e.g. if you have 1000x coverage of a 1 kbase amplicon then you have 1 Mbase of total sequence but only 1 kbase of unique reference sequence.

We think the second measure (unique reference sequence) is more important for training a general basecaller. If you have too little data there is a risk that the model would overfit to the reference sequence. We are not sure what the lower limit is, and there is a lot of literature on techniques to avoid overfitting, but we can say that you are probably ok if you have at least 1 Mbase. Another kind of overfitting that can occur is if the composition of your training set is biased in some way (e.g. extreme GC content) then the model might not generalise very well. For comparison, the models released in Guppy are trained with hundreds of Mbases of unique reference sequence from a variety of organisms.

If you have only a small amount of training data to start with, then you might have to employ some other method to bootstrap your way to a larger training set. If that is the case then we can offer further guidance.

zimoun commented 5 years ago

Dear, I have detailed in private email.

Say I fall in the category 1., it is DNA, I already can basecall the data and the accuracy is "good" i.e., the same than people report. \o/

Thank you for your explanations.

Is the list of variety of organisms to train Guppy available elsewhere?

Say the Guppy architecture uses a window of k bases (CNN or other), e.g., let fix k=5 and the base at position n uses the 4 bases at position n-2, n-1, n+1, n+2 (one way or another). And, there is 4**k kmers---here 1024. But the "effect" is longer than 2 bases (5-mer), say 3 bases, then it is 16384 possible kmers. And so on.

Some kmers are more represented in some organisms than other. Because, the company will not release the basecaller architecture (and I understand), it will help when understanding the origin of errors to know the list of the organisms used to trained Guppy.