Best practices training from scratch using only modified inputs

nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.

Other

197 stars 30 forks source link

Hi Megalodon Team

I am currently training a modified base model from scratch as the C-modification in question does not have any existing models available. From an experimental setup, we have two sequencing runs, one where all Cs are modified (let's call it Run A) and another where none of the Cs are modified (Run B). The reference genome for both experiments is the same.

So far, I have used Run A as starting point for model training: I generated the signal mappings, trained a model with taiyaki, and then used that uncalibrated model to evaluate both Run A and Run B. As was perhaps expected, every reference cytosine in both Run A and Run B is classified as modified with a very high probability (>95%): I assume that since I am feeding the model training with only positive samples, none of the relevant signal that would be used to distinguish modified from unmodified C is being captured.

After calibrating the model using Run B as a control, and rerunning megalodon with this calibrated model, all reference Cs are now classified as canonical (again not surprising since all Cs were called modified for both Run A and B prior to calibration).

Clearly this is not the right approach. Ideally, I could train on both Run A and B simultaneously by marking all Cs in Run A as modified and all Cs on sequences coming from Run B as canonical, but as far as I am aware, this is not possible with Megalodon. What is correct approach here?

Thank you in advance.

The Taiyaki misc/merge_mappedsignalfiles.py script is the command for which you are looking (see here).

This will merge the alphabets and create a new training file with the combination of reads from both runs.

On another note, I would highly recommend to use this training file to train the model using the newer Remora framework for modified base training. This framework is much more likely to work especially in this case where all bases are modified in the training sample. The megalodon/flip-flop modified base models trained from fully modified samples have not worked well in the past as the model is able to "learn" whether a read is completely modified or not and is not able to generalize to reads with partial modification. The Remora framework takes in only small chunks of data (~10 bases of signal) to predict whether a base is modified or not. See more details in my talk at NCM last month. These Remora models are currently run-able in megalodon and Bonito and will be ported into Guppy in the near future.

I would still recommend that a fully modified sample may have issues as a training sample, but the Remora framework has a much higher chance of successful results in my opinion. Best of luck in your research!

nanoporetech / megalodon

Best practices training from scratch using only modified inputs #234