Clarify usage of pandora compare

mbhall88 commented 4 years ago

This issue is for discussing the role of pandora compare in this paper (if it does play a role?).

@iqbal-lab will we also run the samples through compare? If so, what questions, specifically, are we trying to answer with it? Is it the same as the per-sample, but just trying to see if compare gives better results?

iqbal-lab commented 4 years ago

Yes, for the distance calculation

mbhall88 commented 4 years ago

So we expect that the distance between samples would be different using compare?

iqbal-lab commented 4 years ago

Yes, we had a long discussion about this. That's why compass uses gVCF. Our equivalent is compare.

mbhall88 commented 4 years ago

Yes, we had a long discussion about this. That's why compass uses gVCF. Our equivalent is compare.

Could I ask you to elaborate? I appreciate we had a long discussion but the reason I made this issue was to get all of our discussions down on paper as they always leak out of my brain. And I'm struggling to come up with the reason we expect compare to produce better clusters myself... :sweat:

iqbal-lab commented 4 years ago

If I have two samples, each with a normal (ALT calls only) VCF, then i end up with some SNPs in common to both, some only in one or the other. When i calculate the distance between them, it's easy to process the SNPs they have in common. But what do i do with the SNP at position 1050, where i have a record in sample 1 but not sample 2? Either i treat sample 2 as REF here, or NULL (ie ignore the position). Both of those choices impact the genetic distance one way or the other. Oxford's solution is to use a gVCF, so that a call is made at every position. This also has the benefit that you can process each sample independently and get an output that allows comparison. (There is a dirty underbelly, where gVCFs cause problems basically at tricky places where there are multiallelics or overlapping variants etc etc).

Pandora has two ways of getting comparable VCFs across all samples

Dont do de novo, and then all VCFS are the same, they just represent the graph. This will work of course, but will miss outbreak specific SNPS.
Do do de novo, and use compare.

(Note the third option, of doing de novo, not using compare, but calling REF or NULL for SNPS that are in one VCF but not another, will always ignore the de novo SNPs, so they'll never contribute to the distance)

We have to run compare, for this reason. The incremental benefit of compare is reduced if you have a bigger/denser PRG of course because there is less for de novo to find. But we have to run it and see - we need the data. If it turns out, with data in hand, that you can use pandora map VCFs with no de novo and get pretty good results, you can try and make the case that those results are good enough and ignoring compare makes pandora a lot more lightweight/easy. One option would be to make a pretty dense PRG, and then argue that distance purely using PRG SNPs is a lower bound on the true distance (the PRG misses some SNPs, which might increase the true distance), so that if the PRG-based distance was >12 (or whatever) then the true distance is definitely >12. It's a good argument. But we should run both and compare results.

One last point - with real nanopore data, we will often have to contend with lower coverage, because of multiplexing and low load etc. Compare allows you to recover SNP calls that you might not otherwise make because of coverage drops. This will make a difference on 10x,20x, probably also 30x genomes.

mbhall88 / head_to_head_pipeline

Clarify usage of pandora compare #50