Closed mbhall88 closed 1 year ago
The tl;dr of https://github.com/rmcolq/pandora/issues/294 is that this sample has short reads (~50bp) and all have Ns in the middle. So we lose a lot of minimizers. The default minimum size of a cluster of hits in pandora is 10, and we basically never get more than that on a read for this sample (https://github.com/rmcolq/pandora/pull/295#issuecomment-1244883458 sums this up).
So the question is (cc @iqbal-lab), do we
drprg
). From Page 45 of Rachel's thesis
When the minimum size of a cluster is set too low, we have more false positive local graphs identified as present in the dataset, and also have to handle more noise downstream when inferring a mosaic sequence and genotyping. When it is set too high, we have less sensitivity to discover loci that are present.
For the purposes of drprg
, we aren't concerned with false positive loci discovery - especially for MTB. So maybe something lower (like 5?) could be better?
Definitely refuse to analyse it!
That does feel a bit sly though given mykrobe and tbprofiler produce good predictions for this sample...
Sorry, I don't mean reject the sample up front if it has a few short reads. But effectively ignoring short reads is fine IMO. Fine if Mykrobe and tbprofiler win on this one. The future is long reads, we shouldn't contort ourselves over tiny ones
Run
drprg
on all samples.This is nearly sorted, but there is one sample in the initial testing set that fails due to https://github.com/rmcolq/pandora/issues/294.