nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
391 stars 73 forks source link

How does medaka handle duplex reads/quality scores? #463

Closed shenker closed 8 months ago

shenker commented 8 months ago

I couldn't find this documented anywhere, and looking through the code, I couldn't figure it out myself. I see a num_qstrat parameter in medaka/features.py, which looks relevant. It looks to be set to 15.

R10.4.1 non-HD flowcells produce a mix of (mostly) simplex and (~10%, in my hands) duplex reads. Since duplex reads have a much lower error rate than simplex reads, presumably they should be upweighted (or otherwise distinguished) in medaka's pileup feature matrix relative to simplex reads. I don't want to throw out simplex reads, because that's most of my data, and I want to squeeze out as much consensus accuracy as I can. Assuming dorado is outputting sensible per-base qscores, weighting based on those qscores should suffice, since duplex reads should be much higher quality than simplex reads. medaka/features.py includes some model classes which appear to stratify features by qscores (with a default stratification of 15?). So medaka may already be doing the right thing, but I wanted to check.

1) Could you clarify how medaka uses input read qscores? 2) What is the recommended medaka workflow for using a mix of duplex+simplex reads as input?

For context, I'm sequencing ~4kb PCR amplicons (which are barcoded using 10^6 custom barcodes). I'm grouping reads by barcodes, with 10-100x (duplex+simplex) reads per barcode, and feeding those read groups into medaka smolecule to get high-quality consensus sequences.

3) Speculative question: I haven't gotten my hands on R10.4.1 high duplex flowcells yet. In the regime where most of my reads are duplex, I'm curious how much improvement medaka offers over simply running duplex reads through spoa/abpoa. Do you have a guess?

cjw85 commented 8 months ago

Hi @shenker,

  1. Medaka does not use quality scores. The code you've referenced was a research idea that isn't currently used in any of the production consensus models.
  2. We have no recommendation for this: I don't believe we have released any models for handling duplex data. By the time you are handling many-fold coverage of reads the power of duplex to resolve error at the single-read level is reduced through the population observations. (If you trawl the code hard enough you will actually find that medaka has the ability to handle simultaneously different modalities of data).
  3. See 2.

medaka is at its heard very much a consensus caller, as distinct from a variant caller. Even the haploid variant calling mode (and defunct diploid calling pipeline) is implemented as a consensus call with post processing to pull out differences to a reference. Duplex is good for pulling out e.g. low frequency variants in a polulation of molecules, or consensus calling at low depth; at higher depth and wanting simply a consensus, the advantage over simplex is less clear cut.