jmonlong / sveval

Functions to compare a SV call sets against a truth set.
MIT License
25 stars 6 forks source link

Could sveval be less strict on variation in insertion/ref calls #11

Open glennhickey opened 4 years ago

glennhickey commented 4 years ago

If I have I compare a 0/1 insertion of TTTTTT with 0/1 TTTTTTA sveval will call it a match because the sequences are almost the same.

But if I have a 1/2 insertion of TTTTTT, TTTTTTA and compare that to a 1/1 insertion of TTTTTT then sveval will call it an error.

Similarly, I'm seeing cases of something like of a 0/1 TTTTTTTT -> T deletion being called wrong when its compared to something to something like a 1/2 TTTTTTT ->T,TTTTAT` (ie the correct SV deletion with a SNP on the other haplotype.

Would it be possible to treat all these cases equivalently leniently? I think these types of things are why the new caller (designed for augmented grpahs, and much more sensitive to calling small variants) https://github.com/vgteam/vg/pull/2641 is getting worse performance on the usual SV benchmarks. `

jmonlong commented 4 years ago

I'm surprised the insertion example didn't work, even in "genotype" mode. When using all the recommended parameters, the alleles are supposed to be split but then they should be merged back together into an hom insertion. Might be a problem with the hets merging, I'll investigate.

For the deletions, I can imagine that the situation you describe doesn't work yet. After splitting the alleles, one should be filtered out because not really a large deletion (TTTTTTT -> TTTTAT) but the size filtering comes at the end. Also the code is written with simple DEL/INS in mind, not multi-bases like this. I'll update the code to take the size into account when comparing deletions instead of the region they span. That should help a bit.

In any case if you have call/truth VCFs with these kind of calls, could you send it to me?

glennhickey commented 4 years ago

I've put two vcfs on courtyard here /public/home/hickey/dev/work/compare

here's a 0/1 deletion that gets counted as an error

bcftools view HG002_15_aug.q5.decon.vcf.gz -Hr 15:24107723

15  24107723    75760639_73182181   ATATATATATTATATATATATATATATATATAATATATATATATAAAATATATATTTTATATATATATATATATAATATATATATATAAAATATATATATTATATATATATATTATATATATATATATAAAAAATATATATTTTACAGTGGTGC  A,ATATATATATTATATATATATATATATATAATATATATATATAAAATATATATTTTATATATATATATAATATATATATATAAAATATATATATTATATATATATATAATATATATATATATATAAAAAATATATATTTTACAGTGGTGC    894.882 PASS    DP=39   GT:DP:AD:GL:GQ:GP:XD:MAD    1/2:39:0,21,18:-90.95,-38.2007,-44.1942,-32.4246,-2.23981,-39.5074:5:-5.73203:29.2005:18

bcftools view giab-15.vcf.gz -Hr 15:24107723

15  24107723    HG2_Ill_GATKHCSBGrefine_10102   ATATATATATTATATATATATATATATATATAATATATATATATAAAATATATATTTTATATATATATATATATAATATATATATATAAAATATATATATTATATATATATATTATATATATATATATAAAAAATATATATTTTACAGTGGTGC  A20 PASS    AC=4;AF=0.666667;AN=6;NS=3  GT:GTcons1:PB_GT:PB_REF:PB_ALT:PBHP_GT:PB_REF_HP1:PB_ALT_HP1:PB_REF_HP2:PB_ALT_HP2:10X_GT:10X_REF_HP1:10X_ALT_HP1:10X_REF_HP2:10X_ALT_HP2:ILL250bp_GT:ILL250bp_REF:ILL250bp_ALT:ILLMP_GT:ILLMP_REF:ILLMP_ALT:BNG_LEN_DEL:BNG_LEN_INS:nabsys_svm 0/1:0/1:0/1:46:44:0|1:42:4:1:40:1|0:0:23:17:0:0/1:28:16:./.:.:.:.:.:.   1/1:1/1:1/1:1:47:./.:0:2:0:6:1/1:0:12:0:11:1/1:0:37:./.:.:.:.:.:.   0/1:0/1:0/1:15:15:0|1:15:0:0:15:1|0:0:21:5:0:0/1:33:17:./.:.:.:.:.:.

likewise for insertion

bcftools view HG002_15_aug.q5.decon.vcf.gz -Hr 15:94933278
bcftools view HG002_15_aug.q5.decon.vcf.gz -Hr 15:94933278

I could not find an example of slightly different homozygous insertions getting counted wrong, so please disregard that complaint -- sorry!

jmonlong commented 4 years ago

Thanks for the VCFs, it helped a lot when exploring the behavior in these types of VCF. I used it to debug a new way of matching variants when testing for exact genotype (using bipartite clustering).

I just merged the PR #12 and it addresses the situations you described (as far as I could see). Let me know if you see something else that's not working as we'd want.

glennhickey commented 4 years ago

I've been relying entirely on Docker for running sveval. Do you mind making an image for this version? The latest one I'm seeing on https://hub.docker.com/r/jmonlong/sveval/tags is from 3 months ago. Thanks!