Closed Mathias-Boulanger closed 4 months ago
The error here is the intended behavior. Remora models (and thus datasets) are linked to a single canonical base. Multiple alternatives to the canonical base are described by one model, but alternatives to multiple canonical bases should be separated into separate models (and datasets). These models can be run simultaneously in dorado so there should be no penalty at inference time for the models being separated. Hopefully this helps clear up the intentions, but please do post if you have any further questions.
That is indeed more clear. I will train both separately for each canonical base and then use the 2 models simultaneously in dorado.
However, I don't understand why for model trained for the same canonical base but on different motifs (CG and GC for example) I cannot export the pytorch model in dorado format. I got this:
remora model export train_results_CpG_GpC/model_best.pt dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1
[23:10:57.468] Loading model
[23:10:57.726] Loaded a torchscript model
[23:10:57.727] Exporting model to dorado format
[23:10:57.921] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv1.weight.tensor
[23:10:57.928] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv1.bias.tensor
[23:10:57.936] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv2.weight.tensor
[23:10:57.942] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv2.bias.tensor
[23:10:57.949] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv3.weight.tensor
[23:10:57.954] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv3.bias.tensor
[23:10:57.961] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv1.weight.tensor
[23:10:57.966] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv1.bias.tensor
[23:10:57.973] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv2.weight.tensor
[23:10:57.979] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv2.bias.tensor
[23:10:57.998] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/merge_conv1.weight.tensor
[23:10:58.004] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/merge_conv1.bias.tensor
[23:10:58.012] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.weight_ih_l0.tensor
[23:10:58.018] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.weight_hh_l0.tensor
[23:10:58.024] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.bias_ih_l0.tensor
[23:10:58.030] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.bias_hh_l0.tensor
[23:10:58.036] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.weight_ih_l0.tensor
[23:10:58.042] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.weight_hh_l0.tensor
[23:10:58.048] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.bias_ih_l0.tensor
[23:10:58.054] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.bias_hh_l0.tensor
[23:10:58.060] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/fc.weight.tensor
[23:10:58.091] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/fc.bias.tensor
[23:10:58.103] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/refine_kmer_levels.tensor
Traceback (most recent call last):
File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>
sys.exit(run())
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run
cmd_func(args)
File "miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 939, in run_model_export
export_model_dorado(ckpt, model, args.output_path)
File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/model_util.py", line 213, in export_model_dorado
raise RemoraError("Dorado only supports models with a single motif")
remora.RemoraError: Dorado only supports models with a single motif
Does that mean I should train also two models separately? But how will be encoded the methylation in the BAM, in two separate MM/ML tags or merged (this is the same modified base 'm')?
Do you have any insights about the best practice to train Remora models?
Thank you for your help!
This is a current limitation in Dorado, but not Remora models. You can train a model in CG and GC context in a single model, but Dorado only supports a single motif for each Remora model. I think Dorado will also not support multiple models for against the same canonical base. These issues would be best directed to the Dorado repository if they are required for your research.
Given the current state of the software your options are to
remora infer
with this model
I hope this helps a bit or at least points in the right direction.
Thank you for your useful insights! I transferred to issue to Dorado repository here
I think I will try to convert the model to a all context one and then filter for my motifs of interest. To do so, I just need to manually change the metadata of the model?
Using remora infer
will work as well but will increase the execution time of my pipeline significantly.. But worth to try :)
I'll will keep you posted of what was the best..
Thank you again
Converting to an all-context model would indeed be a workaround here. Note that performance outside of the trained contexts would likely be quite poor.
Have you been able to get along with this fix?
Hi Marcus,
This is still on the todo list, unfortunately, due to the priorities in the project. But, actually good timing, I should put my hand in it soon.
This is frustrating because the model as it's trained today could be used with Bonito as well. (We are already using bonito to call CG and GC methylation with a custom model trained with Remora 2.0 and 4KHz-LSK114 datasets). However, Bonito 0.7.3 is not supporting Remora > 3.0 models...
Anyway, I'll keep you posted.
I am testing upgrading the remora dep in bonito right now, but as a workaround you should just be able to bump the remora version in the bonito requirements.txt
file and reinstall. I do not think there are any breaking changes in remora 3.0 in terms of the interface used by bonito.
I just updated the requirements.txt
file in the bonito repo with:
ont-remora==3.1.0
pod5==0.3.6
And it's working like a charm! Thank you. I am currently testing the behavior of the model on a large dataset in comparison of the AllC 5mC pretrained model.
In parallel, I finally succeed in training the same model in the 'allC' settings and convert it for dorado usage. It's currently running. I'll keep you posted.
Thank for your help to let us move forward.
The model is working very nicely! You can see that it is correcting the GpC vias that we identified in the 5mC allC 5kHz model. Overall, the custom model reduces the number of methylated outliers (compared to BS) on the bottom right corner of the scatters.
Here is the model performance:
I am looking forward to convert this model in Dorado usable one when the 2 context feature will be supported!
Thank you again
Hi,
I got an error (c.f below) when running
remora dataset prepare
using multiple focus bases. I already trained models in the same spirit using remora 2.0, that why I don't know if that's an expected behavior...If this is expected, then how can I train models on a specific mod base taking into account that other base/context can be also methylated?
Also a more general question, what is the best practice to infer train remora models? Should I subset 10-15% of my training data for validation (and use the rest to train) or should I use everything to train and infer with the same dataset?
Thank you for your help
Remora command:
Error log:
Remora version: