Question about pseU calling model rna004_130bps_sup@v5.0.0_pseU@v1

AnWiercze commented 3 weeks ago

Hi ,

I re-basecalled two control sequences containing 100 % and 0 % pseU modified U's at one position, respectively, using the latest dorado basecaller (v0.7.0) and pseU modification calling (rna004_130bps_sup@v5.0.0_pseU@v1).

With a pseU call probability threshold of > 0.7, I got ~100% and ~0% modified reads for each control sequence, respectively... as expected. However, I then realized that reads containing a pseU-specific U>C mismatch (known from both the Guppy and Dorado basecallers) are simply skipped for modification calling, as they are very likely already classified as mutations. Looking at all reads (including “mutated” or deleted bases), the proportion of pseU bases/reads classified by Dorado is much lower (down to 11% modified bases in total).

This creates a misleading representation of the total proportion of modified bases.

What would you recommend to address this problem? I was thinking of considering all mismatched C>U bases and the bases classified as modified by Dorado to get a complete picture of modified and unmodified bases, but this is certainly not optimal!

Thank you very much for your help!

Best regards, Anna

marcus1487 commented 3 weeks ago

On our ground truth strands internally we are only seeing a 6% U->C error rate at ground truth psedouridine sites. So seeing this large a mismatch rate at pseudouridine sites is quite unexpected. Would you mind sharing a small bit of the data you are testing here?

AnWiercze commented 3 weeks ago

Sure! In which format do you need the data and how can I share it with you? As this is unpublished data, I cannot upload the files to public repos.

This is how the IGV screenshot of the modified position in a CCUAG context looks like: IGV_control_100

Did you train the model with random kmers around the modified position? I think the U>C mismatch rate is highly kmer specific and maybe some kmer's were overrepresented in the training dataset that doesn't show a U>C mismatch?

marcus1487 commented 3 weeks ago

A sample of the POD5 and the reference would be sufficient for exploration on our end. Can you email me at marcus.stoiber[at]nanoporetech.com to discuss file transfer.

For training, we use biological samples with a spike-in IVT. So the k-mer content is likely quite biased, but the 5-mer contexts should be quite well covered with the samples chosen. This is certainly an unexpected result.

nanoporetech / dorado

Question about pseU calling model rna004_130bps_sup@v5.0.0_pseU@v1 #867