nanoporetech / remora

Methylation/modified base calling separated from basecalling.
https://nanoporetech.com
Other
151 stars 20 forks source link

canonical sample vs modified sample #113

Closed BioRB closed 11 months ago

BioRB commented 11 months ago

Hello developers! we are planning a new sequencing for epigenetic modification detection on a bacterium. I have some doubts regarding the protocol because it is not clear in the pipeline provided by ONT. My question is: when you say "can_reads" and "mod_reads", what exactly you are referring to? I mean, I have my gDNA and we want to use remora for methylation detection, right? Should I generate 2 libraries in parallel, one for the native gDNA ( that holds the methylations = mod_reads) and another starting from the DNA after demethylation treatment ( or PCR amplification that "removes" the methylations?) = can_reads? in the workflow present here:https://nanoporetech.com/sites/default/files/s3/literature/epigenetics-workflow.pdf it is not clear. Especially this sentence:"Remora models separate canonical basecalling from methylation calling,thus enabling the highest quality canonical and methylation calls from a single run, with minimal computational overhead. " So we need to do 2 libraries or just the native DNA is enough? Can you clarify this point? Can Remora work without the PCR amplified sample? This concept is quite confusing for me and once talking with the specialist from ONT he told us that in theory, you should see methylations analyzing only the gDNA ( I did it with older tools like mCaller and Nanopolish). So before starting our experiment, I would like your comment on it. thanks for advice. best, RB

marcus1487 commented 11 months ago

It would be good to start with the goal of your analysis, but I will try to cover some of the direct questions and provide some general guidance.

The Remora code base is mainly geared towards training modified base models, including data preparation, training and validation. As such most of the instructions in the README are aimed at helping users train their own models. The models produced internally from the Remora code base are then shipped with our production basecaller (Dorado and previously Guppy). If you are simply looking to use our trained Remora models you should use Dorado. Some research release models can also be found in the Rerio repository. I will update the README to make this more clear at the very top.

The "can_reads" and "mod_reads" are training sets for our 5mC CG-context model. These are PCR and enzymatically modified samples. If you are not interested in training a model this is not important.

I mean, I have my gDNA and we want to use remora for methylation detection, right?

You want to use Remora models which are built into Dorado. You should use Dorado with the --modified-bases flag set to get modified base calls output into the BAM file.

Should I generate 2 libraries in parallel, one for the native gDNA ( that holds the methylations = mod_reads) and another starting from the DNA after demethylation treatment ( or PCR amplification that "removes" the methylations?) = can_reads?

For modified bases for which we have a released model (5mC, 6mA, and 5hmC in CG contexts) you only need a genomic sample. For other modified bases we do not currently have a model released. Bases coming in the near future are 5hmC all-context, 4mC all-context and m6A in RNA.

Especially this sentence:"Remora models separate canonical basecalling from methylation calling,thus enabling the highest quality canonical and methylation calls from a single run, with minimal computational overhead. "

This is mostly referring to a break from previous models where the modified base calling and canonical basecalling were completed by the same model. The Remora models attach on to the output of the canonical basecaller (hence the name Remora) and apply a second neural network model specifically trained on only the modified base detection task. When the previous models were combined the canonical sequence accuracy was reduced when adding modified bases. This no longer the case with Remora models as the canonical basecalls are completely unchanged when performing modified base calling. Thus allowing the highest accuracy canonical and modified base calls at the same time.

So we need to do 2 libraries or just the native DNA is enough? Can you clarify this point? Can Remora work without the PCR amplified sample?

Remora models work without a canonical sample. You do not need a canonical control sample to obtain the highest quality modified base calls.

icemduru commented 11 months ago

Hi @marcus1487, since there no model for RNA modifications, I guess we have to have canonical sample for modified base calling, right ? And do we need canonical sample for each organisms or it is possible to have one ultimate canonical sample and use that for all organisms ?