Closed ihodes closed 9 years ago
So I actually think we should get rid of this feature soon/replace it with a better, clearer concordance system. This wasn't really a bug (strictly speaking, I'd call it a bug); the two counts of true positives were both somewhat valid—one looked for a match on the sample name, as well, which prevents duplicate counts of true positives on multi-sample VCF comparisons. I've switched the other join to also only count true positives with the same sample name, for now. Ideally, we'll improve our comparison system a bit, but that'll take a lot of work (@jaclynperrone & I chatted about this some more today).
I didn't include a test, because I think this whole thing needs to be changed significantly. Will leave it to the reviewer if you think it needs a regression test anyway!.
This looks good to me. I think as long as we make it clear how we calculate those values in the user documentation, either way should be fine. Reading your last comment, I thought that you changed another join to start matching with sample names, but this change set features a change in a single join statement. So just to make sure: you didn't forget to push another change out, right?
Also, just out of curiosity: how does the comparison numbers look after this change? Do we have 3 true positives and 11 false positives?
Ah sorry, I wasn't clear. Here (https://github.com/hammerlab/cycledash/blob/master/cycledash/api/genotypes.py#L170) we already join on sample_name as well, so I just made sure both joins join on the sample_name in addition to the other fields.
thanks for the clarification! Good to go, then :+1:
Also, just out of curiosity: how does the comparison numbers look after this change? Do we have 3 true positives and 11 false positives?
0 true positives; as the VCF being compared to doesn't have a sample with the same name as the examined VCF (this is why this is a problematic way of calculating true/false/pos/negs.) Needs a serious rework in light of how this feature is likely to be used; we're punting on it until we know who needs to be using this feature next.
This means we don't use the sample name to test concordance, possibly leading to weird numbers when looking at multi-sample VCFs.
fixes #822