Open LiangdeLI opened 2 years ago
I'd guess it's something to do with automatic filtering in plink or something to do with missing data. How different are the results? If sgkit and hail are agreeing, then I'd be happy enough that we're doing something sensible, so unless you'd like to do more diagnostics here I think we can close the issue?
How different are the results?
Here is some of examples, with each line ordered by sgkit, hail, plink
0.00039936 0.00039936102236421724 0.001599
0.19388978 0.19388977635782748 0.321
0.18150958 0.18150958466453673 0.3095
0.0009984 0.000998402555910543 0.01401
0.0009984 0.000998402555910543 0.002201
0.00079872 0.0007987220447284345 0.3536
0.19848243 0.19848242811501599 0.3886
0.04792332 0.04792332268370607 0.3636
0.00339457 0.0033945686900958465 0.0768
0.110623 0.11062300319488817 0.323
0.09085463 0.09085463258785942 0.09212
0.00039936 0.00039936102236421724 0.01759
0.00119808 0.0011980830670926517 0.258
0.39157348 0.391573482428115 0.4239
0.16653355 0.1665335463258786 0.1847
0.32647764 0.3264776357827476 0.3787
0.04992013 0.04992012779552716 0.06501
0.21964856 0.21964856230031948 0.4947
0.00039936 0.00039936102236421724 0.3054
0.05391374 0.05391373801916933 0.0607
0.02875399 0.02875399361022364 0.4343
0.00139776 0.0013977635782747603 0.1406
0.00039936 0.00039936102236421724 0.003197
0.00319489 0.003194888178913738 0.1355
0.00119808 0.0011980830670926517 0.002402
0.00758786 0.0075878594249201275 0.01764
0.00079872 0.0007987220447284345 0.0022
unless you'd like to do more diagnostics here
A diagnostics is probably necessary for my paper on this benchmark, because I need to prove the results of these tools in each methods are equal.
Eric suggested half missing calls in the last meeting. I tried the Perl program solution in that post. And it seems that the 1000 genome vcf using "0|0" instead of "0/0" to represent calls, so I also tried "|" version of that Perl command. By now they seems not work. I have written a program to scan chr21.vcf to find whether there are actually half missing calls in it.
I probably will make a bit more effort to explore the problem here.
The plan then is to find those 3350 different AF's variants, explore what's special in their original vcf file records.
I doubt there's anything interesting going on @LiangdeLI, and definitely not from the perspective of presenting benchmark results. Plink is doing something very slightly different on a tiny fraction of the sites, and this won't affect timings in any way at all. Even if we figured out what plink is doing, I doubt we'd be able to make it do the same things as sgkit/Hail.
I would just note in the text when reporting these results that plink computed slightly different values to sgkit/Hail at 0.3% of sites, and leave it at that.
Sure, thanks @jeromekelleher! It makes sense. I will just stop exploring this problem here, and only mention this little difference a bit in paper.
FWIW I’m pretty interested in the difference! I’d prefer to keep this open and figure it out, even if it’s not relevant for the benchmarking exercise.
Eric suggested half missing calls in the last meeting. I tried the Perl program solution in that post. And it seems that the 1000 genome vcf using "0|0" instead of "0/0" to represent calls
By half call, I mean situations like "./0" actually (the "." means the call is missing). The "|" instead of "/" just suggests phasing. Hail import_vcf seems to point at that now by default btw (i.e. the "PGT" VCF field instead of "GT") but I doubt that's a source of problems since sgkit uses GT and they're the same. Either way, you probably want to make sure Hail is using GT in the future.
I mean situations like "./0" actually (the "." means the call is missing).
Yeah, I know. The solution in the biostars link you gave me is changing "./0" or "0/." to "./." . I was just saying, I also tried "|" version, as the 1000 Genome vcf use it. And thanks @eric-czech for this more explanation on Hail and Sgkit importing vcf.
@hammer I'm happy to explore this once I have a bit more free time after the schools' application and the paper, unless it becomes urgent.
I'm writing code to verify the benchmark's results are the same for three tools and found this problem.
The allele frequency is run on chromosome 21, which has 1105538 variants. I found Sgkit and Hail produced same results, but they have 3350 results different with Plink's, so 0.3% of results are different. I have excluded the impact of precisions.
I remember I was told the correctness had been tested and proved, did you test with Plink?