Open glennhickey opened 4 years ago
I'm surprised the insertion example didn't work, even in "genotype" mode. When using all the recommended parameters, the alleles are supposed to be split but then they should be merged back together into an hom insertion. Might be a problem with the hets merging, I'll investigate.
For the deletions, I can imagine that the situation you describe doesn't work yet. After splitting the alleles, one should be filtered out because not really a large deletion (TTTTTTT -> TTTTAT
) but the size filtering comes at the end. Also the code is written with simple DEL/INS in mind, not multi-bases like this. I'll update the code to take the size into account when comparing deletions instead of the region they span. That should help a bit.
In any case if you have call/truth VCFs with these kind of calls, could you send it to me?
I've put two vcfs on courtyard here /public/home/hickey/dev/work/compare
here's a 0/1 deletion that gets counted as an error
bcftools view HG002_15_aug.q5.decon.vcf.gz -Hr 15:24107723
15 24107723 75760639_73182181 ATATATATATTATATATATATATATATATATAATATATATATATAAAATATATATTTTATATATATATATATATAATATATATATATAAAATATATATATTATATATATATATTATATATATATATATAAAAAATATATATTTTACAGTGGTGC A,ATATATATATTATATATATATATATATATAATATATATATATAAAATATATATTTTATATATATATATAATATATATATATAAAATATATATATTATATATATATATAATATATATATATATATAAAAAATATATATTTTACAGTGGTGC 894.882 PASS DP=39 GT:DP:AD:GL:GQ:GP:XD:MAD 1/2:39:0,21,18:-90.95,-38.2007,-44.1942,-32.4246,-2.23981,-39.5074:5:-5.73203:29.2005:18
bcftools view giab-15.vcf.gz -Hr 15:24107723
15 24107723 HG2_Ill_GATKHCSBGrefine_10102 ATATATATATTATATATATATATATATATATAATATATATATATAAAATATATATTTTATATATATATATATATAATATATATATATAAAATATATATATTATATATATATATTATATATATATATATAAAAAATATATATTTTACAGTGGTGC A20 PASS AC=4;AF=0.666667;AN=6;NS=3 GT:GTcons1:PB_GT:PB_REF:PB_ALT:PBHP_GT:PB_REF_HP1:PB_ALT_HP1:PB_REF_HP2:PB_ALT_HP2:10X_GT:10X_REF_HP1:10X_ALT_HP1:10X_REF_HP2:10X_ALT_HP2:ILL250bp_GT:ILL250bp_REF:ILL250bp_ALT:ILLMP_GT:ILLMP_REF:ILLMP_ALT:BNG_LEN_DEL:BNG_LEN_INS:nabsys_svm 0/1:0/1:0/1:46:44:0|1:42:4:1:40:1|0:0:23:17:0:0/1:28:16:./.:.:.:.:.:. 1/1:1/1:1/1:1:47:./.:0:2:0:6:1/1:0:12:0:11:1/1:0:37:./.:.:.:.:.:. 0/1:0/1:0/1:15:15:0|1:15:0:0:15:1|0:0:21:5:0:0/1:33:17:./.:.:.:.:.:.
likewise for insertion
bcftools view HG002_15_aug.q5.decon.vcf.gz -Hr 15:94933278
bcftools view HG002_15_aug.q5.decon.vcf.gz -Hr 15:94933278
I could not find an example of slightly different homozygous insertions getting counted wrong, so please disregard that complaint -- sorry!
Thanks for the VCFs, it helped a lot when exploring the behavior in these types of VCF. I used it to debug a new way of matching variants when testing for exact genotype (using bipartite clustering).
I just merged the PR #12 and it addresses the situations you described (as far as I could see). Let me know if you see something else that's not working as we'd want.
I've been relying entirely on Docker for running sveval. Do you mind making an image for this version? The latest one I'm seeing on https://hub.docker.com/r/jmonlong/sveval/tags is from 3 months ago. Thanks!
If I have I compare a 0/1 insertion of
TTTTTT
with 0/1TTTTTTA
sveval will call it a match because the sequences are almost the same.But if I have a 1/2 insertion of
TTTTTT
,TTTTTTA
and compare that to a 1/1 insertion ofTTTTTT
then sveval will call it an error.Similarly, I'm seeing cases of something like of a 0/1
TTTTTTTT -> T
deletion being called wrong when its compared to something to something like a 1/2TTTTTTT ->
T,TTTTAT` (ie the correct SV deletion with a SNP on the other haplotype.Would it be possible to treat all these cases equivalently leniently? I think these types of things are why the new caller (designed for augmented grpahs, and much more sensitive to calling small variants) https://github.com/vgteam/vg/pull/2641 is getting worse performance on the usual SV benchmarks. `