nanoporetech / taiyaki

Training models for basecalling Oxford Nanopore reads
https://nanoporetech.com/
Other
115 stars 42 forks source link

Begginer Questions #82

Closed techsavy12 closed 4 years ago

techsavy12 commented 4 years ago

Hello, I’m a beginner in this field and I had some questions about taiyaki. I would really appreciate if you could help me out.

  1. What is the main purpose of taiyaki? If it is just to basecall can’t we just use guppy and be done. Why are we doing BAM of Mapped Basecalls, Extract per-read Reference, Create Per Read Scaling parameters, training model and using guppy again? [Walkthrough]
    • Are we suppose to use one set of our data to train and then apply our training to other sets of data.
  2. In the modified base walkthrough, under the creating mapped read file we use pretrained/r941_dna_minion.checkpoint. Are we suppose to create this file when we are running our data or its inbuilt and we are just inserting it. If we are creating this data, how do we that.
  3. Do I have to pay to install guppy? If not could you let me know how. I have been being redirected to nanopore site where it tells me you have to pay to download the software.

Once again, Your help is very much appreciated.

aevansNP commented 4 years ago

Hello - I'll try to answer your questions one by one:

What is the main purpose of taiyaki? If it is just to basecall can’t we just use guppy and be done. >Why are we doing BAM of Mapped Basecalls, Extract per-read Reference, Create Per Read Scaling >parameters, training model and using guppy again? [Walkthrough]

For many applications, you can just use Guppy and be done. Taiyaki is the tool we use to train our neural-network basecalling models, used by Guppy.

You only need Taiyaki if you want to train a basecalling model which is different from the ones already available. An example might be DNA which contains modified bases not present in our training data. In some cases training a model for a specific target organism may be useful, although our models are trained on a wide panel of prokaryotic and eukaryotic data, and there are possible pitfalls (overfitting) to training on a narrower range.

Taiyaki is released as a research-level tool, and you will probably find it takes a significant investment of time and effort to get results better than you would get using the pre-trained models.

Are we suppose to use one set of our data to train and then apply our training to other sets of >>data.

It depends what you want to achieve. As always in machine learning (or any sort of model fitting), if you train a model and then test it on the training data, that will give you a misleadingly optimistic view of its performance. It's normal to 'hold out' some of the data for testing.

In the modified base walkthrough, under the creating mapped read file we use >>pretrained/r941_dna_minion.checkpoint. Are we suppose to create this file when we are running >>our data or its inbuilt and we are just inserting it. If we are creating this data, how do we that.

The checkpoint file is in the tar archive along with the other data you need for the walk-through. See the link near top of the modified-base walkthrough.

Do I have to pay to install guppy? If not could you let me know how. I have been being >>redirected to nanopore site where it tells me you have to pay to download the software. Once again, Your help is very much appreciated.

Basecalling software can be downloaded here. You may have to sign up to the 'Nanopore Community' to access this page. If you have problems there is support available here.

techsavy12 commented 4 years ago

Thank You!