nanoporetech / megalodon

Megalodon is a research command line tool to extract high accuracy modified base and sequence variant calls from raw nanopore reads by anchoring the information rich basecalling neural network output to a reference genome/transriptome.
Other
197 stars 30 forks source link

Evaluate my Model #52

Closed techsavy12 closed 4 years ago

techsavy12 commented 4 years ago

Hello, I have trained an RNA model using taiyaki to identify modified base. I have used the model for basecalling and have generated an hdf5 file with shows the probabilities of modification. However, I’m unsure how to generate ROC or AUC curves to evaluate my model.

marcus1487 commented 4 years ago

Megalodon provides several functions to assist in the validation of modified base detection results. These can be found under the megalodon_extras validate group of commands and more documentation can be found here.

Briefly, the main challenge here (especially for RNA) is a ground truth which is required to produce validation results (e.g. ROC and/or AUC). Megalodon validation is mostly broken down into two types, per-read and aggregated. Per-read validation uses the Megalodon per-read modified base database while aggregated results generally use the bedmethyl output from Megalodon, which is also the standard output format for aggregated methylation results across technologies (see ENCODE consortium specifications).

In addition to the per-read/aggregated validation there are also two main options for how to provide the ground truth. One is by providing a control sample (via the --control-megalodon-results-dirs argument). With this setting it is assumed that all sites in the main Megalodon results directory are modified and all those in the control Megalodon results directories are unmodified. The allowed sites can be controlled via a number of other options (see megalodon_extras validate results -h and the docs for more details). The second method to provide a ground truth is to pass a ground truth CSV file containing sites which are known to be modified and sites which are known to be unmodified within the sample of interest. For example, for a human genome sample one might use the set of sites with <2% and >98% methylation results from a bisulfite experiment as the ground truth using this setting. See the megalodon_extras modified_bases create_ground_truth for more details.

techsavy12 commented 4 years ago

Thank You so much for your response. I tried running the code below for starters to basecall using megalodon. I encountered the following ERROR: No valid modified base calibration specified. Could you assist me. Thx in advance megalodon fast5files --do-not-use-guppy-server --taiyaki-model-filename model.checkpoint --outputs mod_basecalls mod_mappings --rna

marcus1487 commented 4 years ago

In order to accurately call modified bases Megalodon requires any new model to have the modified base scores empirically calibrated. See docs for this process here. To skip the use of calibration at basecalling time use the --disable-mod-calibration option. Note that per-read modified base scores will be the same for ROC and AUC purposes, but aggregated results will likely not be as accurate without completing the calibration steps.

techsavy12 commented 4 years ago

log.txt I used the --disable-mod-calibration option, as I aiming to obtain ROC curve. However, I have encountered another error. I believe it is an index error. I have attached an image of the log below. Could you suggest possible ways to troubleshoot it. Thx in advance!

marcus1487 commented 4 years ago

This looks like a bug in the taiyaki backend. Honestly, I have not done much testing on the taiyaki backend since enabling the guppy backend. I will have a look at a fix for this issue.

As a workaround for now (and actually the recommended pipeline for training your own model) I would suggest using the dump_json.py command from taiyaki. Then the produced json model file can be passed for use with the guppy backend for megalodon. When running megalodon specify the RNA guppy config file as well as the the new json model via the --guppy-params megalodon argument. Find some relevant docs here, here, and here.

The taiyaki backend has only really been maintained in order to test non-standard neural network layers which are not compatible with guppy. Standard model architectures should use the dump_json.py and guppy backend.

marcus1487 commented 4 years ago

This was indeed a bug in the taiyaki backend specific to modified base models when outputting mod_basecalls. I have just pushed a fix to this issue. But I would still recommend using the guppy backend as it should be much faster and provide better support than the taiyaki backend.