Bad Results for Basecalling after Training own Model

Hi! Describe the bug After Training my own Model with about 5000 reads I get the following results when I use the model for Basecalling: None of the resulting FASTA files gives a useful result.

To Reproduce I am using the data from https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1727-y#Bib1. Taking about 5000 raw fast5 reads for training. To prepare them for training, I am using tombo preprocess annotate_raw_with_fastqs first with the fastq file, created by guppy and then tombo resquiggle with the reference genome for labeling. Then I am running chiron export and chiron train with the following command:

nohup chiron train -i data/Klebsiella_pneumoniae_INF032_fast5s_chiron/test -o model -m DNA --retrain

DNA is the DNA_default model copied, but I also tried training a completely new model with the same results. This runs with the following message: Model model/DNA saved.

Then I am running the basecall: nohup chiron call -i /mnt/data2/bmestu/goessv/data/Klebsiella_pneumoniae_INF032_fast5s_chiron -o /mnt/data2/bmestu/goessv/chiron_output_train -e fasta -m model/DNA --batch_size 1000

I have tried it already with several number of reads (100, 1000, 5000) and see no improvement. I checked, and Tombo takes the correct Bases for labelling, and when I check the raw folder in results, it reads the correct ones, that I also see in the Fast5 files.

Am I making an obvious mistake? Or have you ever seen something like this? If you need more Information, do not hesitate to ask.

Thanks in Advance! Veronika Environment (please complete the following information):

OS: Ubuntu 20.04.1 LTS
GPU: CPU only
Chiron Version: Chiron version 0.6.1.1
Tensorflow Version: 1.15.0 cpu-only
Python Version: 3.7.0
I am using venv

haotianteng / Chiron

Bad Results for Basecalling after Training own Model #116