Mathias-Boulanger commented 10 months ago

Hi,

I got an error (c.f below) when running remora dataset prepare using multiple focus bases. I already trained models in the same spirit using remora 2.0, that why I don't know if that's an expected behavior...

If this is expected, then how can I train models on a specific mod base taking into account that other base/context can be also methylated?

Also a more general question, what is the best practice to infer train remora models? Should I subset 10-15% of my training data for validation (and use the rest to train) or should I use everything to train and infer with the same dataset?

Thank you for your help

Remora command:

remora dataset prepare \
    --output-path ${wd}data/0_unmeth/prepData/mock_5_CpG_6mA \
    --refine-kmer-level-table ${wd}data/ONT/9mer_levels_v1.txt \
    --refine-rough-rescale \
    --motif CG 0 --motif A 0 \
    --mod-base-control \
    --max-chunks-per-read 20 \
    --num-extract-alignment-workers 24 \
    --num-extract-chunks-workers 24 \
    ${wd}data/0_unmeth/0_unmeth.pod5 \
    ${wd}data/0_unmeth/0_unmeth.pass.bam

Error log:

[14:37:43.988] Extracting read IDs from POD5
[14:37:49.204] Found 1,242,986 valid BAM records. Found signal in POD5 for 100.00% of BAM records.
[14:37:49.302] Making reference-anchored training data
[14:37:49.302] Opening dataset for output
Traceback (most recent call last):
  File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>
    sys.exit(run())
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run
    cmd_func(args)
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 302, in run_dataset_prepare
    extract_chunk_dataset(
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/prepare_train_data.py", line 165, in extract_chunk_dataset
    metadata=DatasetMetadata(
  File "<string>", line 23, in __init__
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 847, in __post_init__
    self.check_motifs()
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/data_chunks.py", line 824, in check_motifs
    raise RemoraError(
remora.RemoraError: Cannot create dataset with multiple motif focus bases: {'A', 'C'}

Remora version:

> remora -v
Remora version: 3.0.0

marcus1487 commented 10 months ago

The error here is the intended behavior. Remora models (and thus datasets) are linked to a single canonical base. Multiple alternatives to the canonical base are described by one model, but alternatives to multiple canonical bases should be separated into separate models (and datasets). These models can be run simultaneously in dorado so there should be no penalty at inference time for the models being separated. Hopefully this helps clear up the intentions, but please do post if you have any further questions.

Mathias-Boulanger commented 9 months ago

That is indeed more clear. I will train both separately for each canonical base and then use the 2 models simultaneously in dorado.

However, I don't understand why for model trained for the same canonical base but on different motifs (CG and GC for example) I cannot export the pytorch model in dorado format. I got this:

remora model export train_results_CpG_GpC/model_best.pt dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1           
[23:10:57.468] Loading model                                                                                                                                            
[23:10:57.726] Loaded a torchscript model                                                                                                                               
[23:10:57.727] Exporting model to dorado format                                                                                                                         
[23:10:57.921] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv1.weight.tensor                            
[23:10:57.928] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv1.bias.tensor                              
[23:10:57.936] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv2.weight.tensor                            
[23:10:57.942] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv2.bias.tensor                              
[23:10:57.949] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv3.weight.tensor                            
[23:10:57.954] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/sig_conv3.bias.tensor                              
[23:10:57.961] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv1.weight.tensor                            
[23:10:57.966] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv1.bias.tensor                              
[23:10:57.973] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv2.weight.tensor                            
[23:10:57.979] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/seq_conv2.bias.tensor                              
[23:10:57.998] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/merge_conv1.weight.tensor                          
[23:10:58.004] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/merge_conv1.bias.tensor                            
[23:10:58.012] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.weight_ih_l0.tensor                          
[23:10:58.018] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.weight_hh_l0.tensor                          
[23:10:58.024] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.bias_ih_l0.tensor                            
[23:10:58.030] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm1.bias_hh_l0.tensor                            
[23:10:58.036] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.weight_ih_l0.tensor                          
[23:10:58.042] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.weight_hh_l0.tensor                          
[23:10:58.048] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.bias_ih_l0.tensor                            
[23:10:58.054] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/lstm2.bias_hh_l0.tensor                            
[23:10:58.060] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/fc.weight.tensor                                   
[23:10:58.091] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/fc.bias.tensor                                     
[23:10:58.103] dna_r10.4.1_e8.2_400bps_sup@v4.2.0_5mC_CG_GC_krebsLab@v1/refine_kmer_levels.tensor                          
Traceback (most recent call last):                                                                                                                                      
  File "/miniconda3/envs/remora_train/bin/remora", line 8, in <module>                                                                                
    sys.exit(run())                                                                                                                                                     
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/main.py", line 71, in run                                                   
    cmd_func(args)                                                                                                                                                      
  File "miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/parsers.py", line 939, in run_model_export                                  
    export_model_dorado(ckpt, model, args.output_path)                                                                                                                  
  File "/miniconda3/envs/remora_train/lib/python3.10/site-packages/remora/model_util.py", line 213, in export_model_dorado                            
    raise RemoraError("Dorado only supports models with a single motif")                                                                                                
remora.RemoraError: Dorado only supports models with a single motif

Does that mean I should train also two models separately? But how will be encoded the methylation in the BAM, in two separate MM/ML tags or merged (this is the same modified base 'm')?

Do you have any insights about the best practice to train Remora models?

Thank you for your help!

marcus1487 commented 9 months ago

This is a current limitation in Dorado, but not Remora models. You can train a model in CG and GC context in a single model, but Dorado only supports a single motif for each Remora model. I think Dorado will also not support multiple models for against the same canonical base. These issues would be best directed to the Dorado repository if they are required for your research.

Given the current state of the software your options are to

run remora infer with this model
- Remora does not support multiple modified base model currently though so you'd have to run the A-mods model separately
- This does not require re-basecalling, but running remora on most contexts may not be too much faster than basecalling alone.
"manually" convert your CG+GC-context model to an "all-context" model
- since calls will be made in all contexts you may want to filter calls to those matching a basecalled motif. See the modkit repo for this type of command.

I hope this helps a bit or at least points in the right direction.

Mathias-Boulanger commented 9 months ago

Thank you for your useful insights! I transferred to issue to Dorado repository here

I think I will try to convert the model to a all context one and then filter for my motifs of interest. To do so, I just need to manually change the metadata of the model?

Using remora infer will work as well but will increase the execution time of my pipeline significantly.. But worth to try :)

I'll will keep you posted of what was the best..

Thank you again

marcus1487 commented 6 months ago

Converting to an all-context model would indeed be a workaround here. Note that performance outside of the trained contexts would likely be quite poor.

Have you been able to get along with this fix?

Mathias-Boulanger commented 6 months ago

Hi Marcus,

This is still on the todo list, unfortunately, due to the priorities in the project. But, actually good timing, I should put my hand in it soon.

This is frustrating because the model as it's trained today could be used with Bonito as well. (We are already using bonito to call CG and GC methylation with a custom model trained with Remora 2.0 and 4KHz-LSK114 datasets). However, Bonito 0.7.3 is not supporting Remora > 3.0 models...

Anyway, I'll keep you posted.

marcus1487 commented 6 months ago

I am testing upgrading the remora dep in bonito right now, but as a workaround you should just be able to bump the remora version in the bonito requirements.txt file and reinstall. I do not think there are any breaking changes in remora 3.0 in terms of the interface used by bonito.

Mathias-Boulanger commented 6 months ago

I just updated the requirements.txt file in the bonito repo with:

ont-remora==3.1.0
pod5==0.3.6

And it's working like a charm! Thank you. I am currently testing the behavior of the model on a large dataset in comparison of the AllC 5mC pretrained model.

In parallel, I finally succeed in training the same model in the 'allC' settings and convert it for dorado usage. It's currently running. I'll keep you posted.

Thank for your help to let us move forward.

Mathias-Boulanger commented 6 months ago

Asset 2@600x

The model is working very nicely! You can see that it is correcting the GpC vias that we identified in the 5mC allC 5kHz model. Overall, the custom model reduces the number of methylated outliers (compared to BS) on the bottom right corner of the scatters.

Here is the model performance: 4_Cpg_GpC

I am looking forward to convert this model in Dorado usable one when the 2 context feature will be supported!

Thank you again

nanoporetech / remora

remora 3.0: error when training on different canonical bases #140

Remora command:

Error log:

Remora version: