Reads reported by taiyaki basecall.py are close to random while guppy with same model works as expected

lpryszcz commented 5 years ago

Hi, I've observed weird behaviour of basecall.py trained on RNA reads with modifications - the accuracy of base-calling is close to random (1). Interestingly, Guppy works pretty well with the same model (2), better than with the default model shipped with Guppy (3). I'm expecting reads with lengths around 2.2-2.7 kb. Can you please explain this? I guess it makes no sense to look at modifications in hdf5 file produced by basecall.py, since the sequence prediction isn't accurate, right?

basecall.py from taiyaki v4.1.0 using new model

basecall.py --device cuda:0 --alphabet ACGU --modified_base_output taiyaki.$s/$n.hdf5 $d $s.train2/model_final.checkpoint >  taiyaki.$s/$n.fa

Guppy 3.1.5 using the same model trained with taiyaki v4.1.0

m=$s.train2/model_final.checkpoint; dump_json.py $m > $m.json
guppy_basecaller --device cuda:0 -c rna_r9.4.1_70bps_hac.cfg -m $m.json --compress_fastq --fast5_out -ri $d -s guppy3.$s/$n

Guppy 3.1.5 using default RNA model

guppy_basecaller --device cuda:0 -c rna_r9.4.1_70bps_hac.cfg --num_callers 4 --compress_fastq --fast5_out -ri $d -s guppy3/$d

myrtlecat commented 5 years ago

Remember you reversed the references for training? You might have to reverse the basecalls as well. Guppy has a flag for this and I think the default RNA config file sets this for you (rna_r9.4.1_70bps_hac.cfg has a line reverse_sequence = 1).

Currently basecall.py doesn't have a similar option but we should add one 😳

lpryszcz commented 5 years ago

Ok, reversing helps, but still the accuracy is much lower than with guppy3. Also reads are much shorter than from guppy3, which I find a bit odd. I guess guppy3 uses some parameters that boost the quality of basecalling. Is there any way of improving the accuracy for basecall.py (ideally by using guppy3 config file?)? Are there any plans for enabling guppy3 to report modified bases?

basecall.py from taiyaki v4.1.0 using new model with reversed sequeces

myrtlecat commented 5 years ago

Mod-base support is planned for a future guppy release (expected this July, see this plenary talk from London Calling around the 23 minute mark for the announcement)

I'm puzzled. basecall.py and guppy should produce very similar results (given the same model). I'm not sure what could have gone wrong... Are you you able to share your model_final.checkpoint and a few reads to help us investigate what's going on?

lpryszcz commented 5 years ago

Mod-base support is planned for a future guppy release (expected this July, see this plenary talk from London Calling around the 23 minute mark for the announcement)

That awesome news!

I'm puzzled. basecall.py and guppy should produce very similar results (given the same model). I'm not sure what could have gone wrong... Are you you able to share your model_final.checkpoint and a few reads to help us investigate what's going on?

Yes, although I'd prefer to do it privately. Could you please write me: lpryszcz at crg.es. cheers!

tmassingham-ont commented 5 years ago

Hello. I suspect that the difference in base call accuracy is due to the different way that the signal is chunked up between Guppy and Taiyaki: Taiyaki is parameterised in terms of input chunk size, whereas Guppy takes the stride of the network into account. For DNA, where the stride is low (2 or 3), this makes little difference but the stride for RNA is 12.

The way to test this would be to add --chunk_size 12000 to your command line. If this works, I’ll push a patch to make Taiyaki’s behaviour more consistent with Guppy.

NB: The base caller in Taiyaki exists so we can do basecalling for models that Guppy doesn’t support. If your model is compatible, base calling with Guppy would always be the preferred option.

tmassingham-ont commented 5 years ago

Beautiful plots, BTW.

myrtlecat commented 5 years ago

According to the guppy devs, mod-bases should already be working with guppy 3.1. If you use the --fast5_out option, guppy will write modification probabilities back to the fast5s that you basecalled. There aren't any tools yet, so you will have to extract this information yourself using hdf5 or h5py. You are looking for a dataset called ModBaseProbs. Please open another issue if you need more help with guppy's mod-bases output.

nanoporetech / taiyaki

Reads reported by taiyaki basecall.py are close to random while guppy with same model works as expected #26