The basecalled sequences are quite different from reference sequences

nanoporetech / taiyaki

Training models for basecalling Oxford Nanopore reads

https://nanoporetech.com/

Other

115 stars 42 forks source link

The basecalled sequences are quite different from reference sequences #38

Closed weir12 closed 5 years ago

weir12 commented 5 years ago

Hi: I've finished a modified base model.And I found strange result: basecalls.fa which is generated by basecall.py with final check model are quite different from modbase_references.fasta which is provided by myself. In my opinion,The training of the model depends on the reference sequence, so when I use the trained model to predict the training set, the generated sequence should be close to the reference sequence.

tmassingham-ont commented 5 years ago

We agree, and this would be consistent with results reported in #29. Would you share the model.log from your training directory and the command line you used to basecall please?

weir12 commented 5 years ago

Sure,Here are the logs of two training rounds. round1_model.log round2_model.log basecall cmd:basecall.py --device 0 --alphabet ACGU --modified_base_output basecalls.hdf5 ${training_reads} tranning8.5_2/model_final.checkpoint > basecalls.fa 2>basecall.log

I didn't follow the instructions to using get_refs_from_sam.py to extract per-read references. I used tombo to re-squiggle signal of direct RNA reads to transcriptome.Then I extracted aligned seq (stored in a HDF5 group written by tombo) of fast5 files to prepare per-read references.Finally, reverse the sequence to fit feature of direct RNA sequencing.I don't know if this will cause any problems. thanks

weir12 commented 5 years ago

Hi: I found subtle differences between the two approaches,by comparing the reference sequence of each read. Reversing (not complement)the per-read references generated by get_refs_from_sam.py and replacing T with U.This is going to be equal to the result of my custom method.So I think the two approaches are similar in essence. However,I found the true sequences obtained by these two methods are inconsistent with the sequences of reads themselves(generated by GUPPY).Perhaps this is due to the high error rate of Nanopore sequencing?

weir12 commented 5 years ago

Update In fact,fastq sequence which is basecalled by modified base model achieved high percent identity with refecence transcriptome.The previous confusion was caused by the lack of comprehensive quantitative analysis. percent_identity