Closed shenker closed 8 months ago
Hi @shenker,
medaka is at its heard very much a consensus caller, as distinct from a variant caller. Even the haploid variant calling mode (and defunct diploid calling pipeline) is implemented as a consensus call with post processing to pull out differences to a reference. Duplex is good for pulling out e.g. low frequency variants in a polulation of molecules, or consensus calling at low depth; at higher depth and wanting simply a consensus, the advantage over simplex is less clear cut.
I couldn't find this documented anywhere, and looking through the code, I couldn't figure it out myself. I see a
num_qstrat
parameter in medaka/features.py, which looks relevant. It looks to be set to 15.R10.4.1 non-HD flowcells produce a mix of (mostly) simplex and (~10%, in my hands) duplex reads. Since duplex reads have a much lower error rate than simplex reads, presumably they should be upweighted (or otherwise distinguished) in medaka's pileup feature matrix relative to simplex reads. I don't want to throw out simplex reads, because that's most of my data, and I want to squeeze out as much consensus accuracy as I can. Assuming dorado is outputting sensible per-base qscores, weighting based on those qscores should suffice, since duplex reads should be much higher quality than simplex reads. medaka/features.py includes some model classes which appear to stratify features by qscores (with a default stratification of 15?). So medaka may already be doing the right thing, but I wanted to check.
1) Could you clarify how medaka uses input read qscores? 2) What is the recommended medaka workflow for using a mix of duplex+simplex reads as input?
For context, I'm sequencing ~4kb PCR amplicons (which are barcoded using 10^6 custom barcodes). I'm grouping reads by barcodes, with 10-100x (duplex+simplex) reads per barcode, and feeding those read groups into
medaka smolecule
to get high-quality consensus sequences.3) Speculative question: I haven't gotten my hands on R10.4.1 high duplex flowcells yet. In the regime where most of my reads are duplex, I'm curious how much improvement medaka offers over simply running duplex reads through spoa/abpoa. Do you have a guess?