nanoporetech / bonito

A PyTorch Basecaller for Oxford Nanopore Reads
https://nanoporetech.com/
Other
395 stars 121 forks source link

RNA model for bonito #16

Open CDieterich opened 4 years ago

CDieterich commented 4 years ago

Dear developers,

would you be able to provide an RNA model for bonito somewhere

bonito basecaller rna_r9.4.1 /data/reads > basecalls.fasta

Thank you Christoph

iiSeymour commented 4 years ago

Hi @CDieterich

I have not trained an RNA model yet, I will update this issue if things change.

Regards

CDieterich commented 4 years ago

Excellent.

CDieterich commented 4 years ago

BTW, is there any manual to do the training myself ?

callumparr commented 4 years ago

Id be also interested in any documentation of using bonito train. Is it similar process to taiyaki? From what I understood from the Nanopore Community meeting when Clive gave a talk, the structure was simpler?

iiSeymour commented 4 years ago

I think it should be straightforward, if not, let me know.

First, make sure you have the training data downloaded.

$ bonito download --training

Then run bonito train and give it an output directory.

$ bonito train model-train-dir
[loading data]
[loading model]
[990000/990000]: 100%|#########################################| [1:23:46, loss=0.2546]
[epoch 1] directory=model-train-dir loss=0.2496 mean_acc=92.351% median_acc=93.035%
[990000/990000]: 100%|#########################################| [1:23:40, loss=0.2010]
[epoch 2] directory=model-train-dir loss=0.2201 mean_acc=93.310% median_acc=94.000%
[990000/990000]: 100%|#########################################| [1:23:41, loss=0.2255]
[epoch 3] directory=model-train-dir loss=0.2038 mean_acc=93.847% median_acc=94.527%
[990000/990000]: 100%|#########################################| [1:23:40, loss=0.2018]
[epoch 4] directory=model-train-dir loss=0.1964 mean_acc=94.090% median_acc=94.608%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.2001]
[epoch 5] directory=model-train-dir loss=0.1899 mean_acc=94.318% median_acc=95.025%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.1862]
[epoch 6] directory=model-train-dir loss=0.1871 mean_acc=94.383% median_acc=95.025%
[990000/990000]: 100%|#########################################| [1:23:31, loss=0.1678]
[epoch 7] directory=model-train-dir loss=0.1813 mean_acc=94.583% median_acc=95.098%
[990000/990000]: 100%|#########################################| [1:23:41, loss=0.1916]
[epoch 8] directory=model-train-dir loss=0.1793 mean_acc=94.634% median_acc=95.396%
[990000/990000]: 100%|#########################################| [1:23:34, loss=0.1865]
[epoch 9] directory=model-train-dir loss=0.1764 mean_acc=94.755% median_acc=95.500%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.1565]
[epoch 10] directory=model-train-dir loss=0.1763 mean_acc=94.737% median_acc=95.500%
[990000/990000]: 100%|#########################################| [1:23:32, loss=0.1580]
[epoch 11] directory=model-train-dir loss=0.1739 mean_acc=94.836% median_acc=95.522%
[125184/990000]:  13%|#######                                  | [10:35, loss=0.1572]

By default, the training will use 1 million chunks with a 1% validation split. You can see the progress of each epoch over the 990,000 training examples with the training loss updating for each batch. At the end of each batch, you get the validation loss and accuracy reported.

snower2010 commented 4 years ago

I am just wondering. Does 'bonito download --training' command can downloaded all the training data that needed to train a satisfying model? Many thanks!

iiSeymour commented 4 years ago

@snower2010 bonito download --training will give you the full training set for dna_r9.4.1 that is used to train the model shipped with bonito. I'm currently only focusing on a single condition.

HTH,

Chris.

snower2010 commented 4 years ago

Got it! Many thanks! By the way, could you also tell me the configuration and the time cost for this specific traning. Thanks!

CDieterich commented 4 years ago

Ok, got back to this now..

any developments on this aspect (RNA modifications) @iiSeymour ?

I would be happy to do it myself provided that there is some documentation for training from scratch / or pretrained ?

Thank you

biobenkj commented 3 years ago

Continuing this thread - would it be worth it if a few of us here put our heads together and attempted training a [direct] RNA model? I know there are boatloads of direct and cDNA RNA for the human NA12878 runs (https://github.com/nanopore-wgs-consortium/NA12878/tree/master/nanopore-human-transcriptome).