Closed techsavy12 closed 4 years ago
Megalodon provides several functions to assist in the validation of modified base detection results. These can be found under the megalodon_extras validate
group of commands and more documentation can be found here.
Briefly, the main challenge here (especially for RNA) is a ground truth which is required to produce validation results (e.g. ROC and/or AUC). Megalodon validation is mostly broken down into two types, per-read and aggregated. Per-read validation uses the Megalodon per-read modified base database while aggregated results generally use the bedmethyl output from Megalodon, which is also the standard output format for aggregated methylation results across technologies (see ENCODE consortium specifications).
In addition to the per-read/aggregated validation there are also two main options for how to provide the ground truth. One is by providing a control sample (via the --control-megalodon-results-dirs
argument). With this setting it is assumed that all sites in the main Megalodon results directory are modified and all those in the control Megalodon results directories are unmodified. The allowed sites can be controlled via a number of other options (see megalodon_extras validate results -h
and the docs for more details). The second method to provide a ground truth is to pass a ground truth CSV file containing sites which are known to be modified and sites which are known to be unmodified within the sample of interest. For example, for a human genome sample one might use the set of sites with <2% and >98% methylation results from a bisulfite experiment as the ground truth using this setting. See the megalodon_extras modified_bases create_ground_truth
for more details.
Thank You so much for your response. I tried running the code below for starters to basecall using megalodon. I encountered the following ERROR: No valid modified base calibration specified. Could you assist me. Thx in advance megalodon fast5files --do-not-use-guppy-server --taiyaki-model-filename model.checkpoint --outputs mod_basecalls mod_mappings --rna
In order to accurately call modified bases Megalodon requires any new model to have the modified base scores empirically calibrated. See docs for this process here. To skip the use of calibration at basecalling time use the --disable-mod-calibration
option. Note that per-read modified base scores will be the same for ROC and AUC purposes, but aggregated results will likely not be as accurate without completing the calibration steps.
log.txt I used the --disable-mod-calibration option, as I aiming to obtain ROC curve. However, I have encountered another error. I believe it is an index error. I have attached an image of the log below. Could you suggest possible ways to troubleshoot it. Thx in advance!
This looks like a bug in the taiyaki backend. Honestly, I have not done much testing on the taiyaki backend since enabling the guppy backend. I will have a look at a fix for this issue.
As a workaround for now (and actually the recommended pipeline for training your own model) I would suggest using the dump_json.py command from taiyaki. Then the produced json model file can be passed for use with the guppy backend for megalodon. When running megalodon specify the RNA guppy config file as well as the the new json model via the --guppy-params
megalodon argument. Find some relevant docs here, here, and here.
The taiyaki backend has only really been maintained in order to test non-standard neural network layers which are not compatible with guppy. Standard model architectures should use the dump_json.py and guppy backend.
This was indeed a bug in the taiyaki backend specific to modified base models when outputting mod_basecalls. I have just pushed a fix to this issue. But I would still recommend using the guppy backend as it should be much faster and provide better support than the taiyaki backend.
Hello, I have trained an RNA model using taiyaki to identify modified base. I have used the model for basecalling and have generated an hdf5 file with shows the probabilities of modification. However, I’m unsure how to generate ROC or AUC curves to evaluate my model.