Run drprg - Githubissues

mbhall88 commented 1 year ago

Run drprg on all samples.

This is nearly sorted, but there is one sample in the initial testing set that fails due to https://github.com/rmcolq/pandora/issues/294.

mbhall88 commented 1 year ago

The tl;dr of https://github.com/rmcolq/pandora/issues/294 is that this sample has short reads (~50bp) and all have Ns in the middle. So we lose a lot of minimizers. The default minimum size of a cluster of hits in pandora is 10, and we basically never get more than that on a read for this sample (https://github.com/rmcolq/pandora/pull/295#issuecomment-1244883458 sums this up).

So the question is (cc @iqbal-lab), do we

Reduce the minimum cluster size for Illumina data (in drprg). From Page 45 of Rachel's thesis

When the minimum size of a cluster is set too low, we have more false positive local graphs identified as present in the dataset, and also have to handle more noise downstream when inferring a mosaic sequence and genotyping. When it is set too high, we have less sensitivity to discover loci that are present.

For the purposes of drprg, we aren't concerned with false positive loci discovery - especially for MTB. So maybe something lower (like 5?) could be better?

Refuse to analyse samples with 50bp reads - this seems quite brutal, but also solves the issue (unless there are longer reads with lots of ambiguous bases).

iqbal-lab commented 1 year ago

Definitely refuse to analyse it!

mbhall88 commented 1 year ago

That does feel a bit sly though given mykrobe and tbprofiler produce good predictions for this sample...

iqbal-lab commented 1 year ago

Sorry, I don't mean reject the sample up front if it has a few short reads. But effectively ignoring short reads is fine IMO. Fine if Mykrobe and tbprofiler win on this one. The future is long reads, we shouldn't contort ourselves over tiny ones

mbhall88 / drprg

Run drprg #12