Closed weir12 closed 5 years ago
We agree, and this would be consistent with results reported in #29. Would you share the model.log
from your training directory and the command line you used to basecall please?
Sure,Here are the logs of two training rounds.
round1_model.log
round2_model.log
basecall cmd:basecall.py --device 0 --alphabet ACGU --modified_base_output basecalls.hdf5 ${training_reads} tranning8.5_2/model_final.checkpoint > basecalls.fa 2>basecall.log
I didn't follow the instructions to using get_refs_from_sam.py
to extract per-read references.
I used tombo
to re-squiggle signal of direct RNA reads to transcriptome.Then I extracted aligned seq (stored in a HDF5 group written by tombo) of fast5 files to prepare per-read references.Finally, reverse the sequence to fit feature of direct RNA sequencing.I don't know if this will cause any problems.
thanks
Hi:
I found subtle differences between the two approaches,by comparing the reference sequence of each read.
Reversing (not complement)the per-read references generated by get_refs_from_sam.py
and replacing T with U.This is going to be equal to the result of my custom method.So I think the two approaches are similar in essence.
However,I found the true sequences obtained by these two methods are inconsistent with the sequences of reads themselves(generated by GUPPY).Perhaps this is due to the high error rate of Nanopore sequencing?
Update In fact,fastq sequence which is basecalled by modified base model achieved high percent identity with refecence transcriptome.The previous confusion was caused by the lack of comprehensive quantitative analysis.
Hi: I've finished a modified base model.And I found strange result:
basecalls.fa
which is generated by basecall.py with final check model are quite different frommodbase_references.fasta
which is provided by myself. In my opinion,The training of the model depends on the reference sequence, so when I use the trained model to predict the training set, the generated sequence should be close to the reference sequence.