nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
495 stars 59 forks source link

Combine modified bam files rather than re-basecalling #732

Closed samuelmontgomery closed 5 months ago

samuelmontgomery commented 6 months ago

Hi,

Just wondering if it is possible to combined multiple bam files where multiple models have been used to call modifications on the same reads? I have basecalled my data using 5mC_5hmC and 6mA - but would like to add in the 4mC model from Rerio without re-basecalling with all three models (as just the two takes ~4 days to complete) Is it possible to merge them after basecalling?

Cheers

HalfPhoton commented 6 months ago

Hi @samuelmontgomery, I think you'll potentially run into issues with duplicate reads in the output. I'll ask the mods team if they have a recommendation as there maybe a way to merge mods tags assuming the canonical basecalls are identical.

HalfPhoton commented 6 months ago

After discussing this with the mods team merging the outputs from two modbase models sharing the same underlying base (C in this case) would invalidate the output.

marcus1487 commented 6 months ago

To elaborate a bit further, modified base calls that conflict (describe modified bases against the same canonical base) cannot be logically merged given the input probabilities of two separate models. For example let's take the situation where you have run the 5mC+5hmC model and then the 5mC+4mC model on the same reads (assuming identical canonical basecalls which should be true). At a base where the call in the one set is 100% 5hmC and in the other set is called 100% 4mC there is nothing the merge script could do to produce a valid result. Setting 4mC and 5hmC probabilities would not likely be a useful output. Using standard filtering in modkit pileup this would mean that a 4mC or 5hmC modified base call would never be called.

The real fix for the specific situation here would be for us to generate a new 5mC+5hmC+4mC model which would produce the valid probabilities for each modified base starting from the input signal. We have not prioritized this given that we have not observed a use case for this model to this time. If this is a required model for your application could you open an issue describing your application and to help track progress/interest on this topic in Rerio.

samuelmontgomery commented 5 months ago

Thanks very much - as we're doing bacterial genomes, the 6mA and 4mC are obviously pretty important, but didn't want to miss out on any 5mC/5hmC I will just run the 4mC_5mC and 6mA models in the future!