Two questions about the results

nanoporetech / taiyaki

Training models for basecalling Oxford Nanopore reads

https://nanoporetech.com/

Other

115 stars 42 forks source link

Two questions about the results #48

Closed weir12 closed 5 years ago

weir12 commented 5 years ago

Hi:

I find basecalls.fa that is generated by basecall.py and model_final checkpoint is completely unable to mapping to the reference sequence files(0% mapping ratio). However,the fastq file which is basecalled by guppy_basecaller and model.json(derived from the model_final checkpoint) is Almost complete alignment to the reference sequence(99.89% mapping ratio) Just like the problem I found before. https://github.com/nanoporetech/taiyaki/issues/38#issue-477469064

basecalls.hdf5which is contains the information about the presence of modifications was observed that the distribution of modified bases was extremely uneven.Some reads contain dozens of consecutive modified bases with high probability

Update:The sequence from basecalls.py need be reversed if sequencing method is direct RNA. The second problem is a parallel computation error in my custom script.Now settled.

callumparr commented 5 years ago

Are you training on RNA modifications? And also how you generated the RNA, what kind of ratio of modified nucleotide did you use?

weir12 commented 5 years ago

Hypoxanthine,a kind of modification of adenosine in RNA.

callumparr commented 5 years ago

Hypoxanthine,a kind of modification of adenosine in RNA.

Is every adenosine modified or is a mixture in each strand.

Have some problems remapping reads because of the low basecall quality using 100% replacement.

weir12 commented 5 years ago

A small amount of adenosine was modified in each read.So I'm worried about the size of the training set.What's more,the sequences(basecalls.fa) basecalled by trained modified models can not be mapped to reference sequences.

callumparr commented 5 years ago

Yeh seems RNA is a lot more challenging because of the ground truth issue.

To generate the ground truth fasta, how did you know which adenosine to switch to your modification? Did tombo tell you where bases differ from canonical? I am still struggling to work out tombo.

Did the hypoxanthine cause a lot deviation in the signal dwell? I think this may cause a lot of problems during the training in losing lots of chunks to analyze. The model I made was no good and just output lots of repetitive sequences. Hoping that relaxing the stringency on dwell signal may help to train on more chunks.

weir12 commented 5 years ago

Yes,Tombo can detects modifications in three ways, while the per_read_statistics generated records potential modifications per base per read.But I'm just using API of tombo to extract the signal I want rather than using detection methods provided by Tombo.

During my training processing ,most of the blocks passed the test(46 passed,0 emptysequence)

Except the trouble of ground truth,pre-trained basecaller model,which is used to determine the alignment between raw signal and reference sequence.It also has a big impact on the results in my view.I'm training my canonical base model for to replace what the taiyaki provides by default(r941_rna_minion.checkpoint).

myrtlecat commented 5 years ago

If the model is for RNA then you will have to reverse the calls from basecall.py to match the output of guppy (note: just reverse, not reverse complement). This is because of differences in chemistry between the RNA and DNA kits.

The next release of taiyaki will include a flag to enable this option; guppy handles it through its configuration files.

weir12 commented 5 years ago

@myrtlecat Thanks!I will pay attention to this differences later.