0.3% of Sgkit and Hail's allele frequency results are different with Plink's

LiangdeLI commented 2 years ago

I'm writing code to verify the benchmark's results are the same for three tools and found this problem.

The allele frequency is run on chromosome 21, which has 1105538 variants. I found Sgkit and Hail produced same results, but they have 3350 results different with Plink's, so 0.3% of results are different. I have excluded the impact of precisions.

I remember I was told the correctness had been tested and proved, did you test with Plink?

jeromekelleher commented 2 years ago

I'd guess it's something to do with automatic filtering in plink or something to do with missing data. How different are the results? If sgkit and hail are agreeing, then I'd be happy enough that we're doing something sensible, so unless you'd like to do more diagnostics here I think we can close the issue?

LiangdeLI commented 2 years ago

How different are the results?

Here is some of examples, with each line ordered by sgkit, hail, plink

0.00039936 0.00039936102236421724 0.001599
0.19388978 0.19388977635782748 0.321
0.18150958 0.18150958466453673 0.3095
0.0009984 0.000998402555910543 0.01401
0.0009984 0.000998402555910543 0.002201
0.00079872 0.0007987220447284345 0.3536
0.19848243 0.19848242811501599 0.3886
0.04792332 0.04792332268370607 0.3636
0.00339457 0.0033945686900958465 0.0768
0.110623 0.11062300319488817 0.323
0.09085463 0.09085463258785942 0.09212
0.00039936 0.00039936102236421724 0.01759
0.00119808 0.0011980830670926517 0.258
0.39157348 0.391573482428115 0.4239
0.16653355 0.1665335463258786 0.1847
0.32647764 0.3264776357827476 0.3787
0.04992013 0.04992012779552716 0.06501
0.21964856 0.21964856230031948 0.4947
0.00039936 0.00039936102236421724 0.3054
0.05391374 0.05391373801916933 0.0607
0.02875399 0.02875399361022364 0.4343
0.00139776 0.0013977635782747603 0.1406
0.00039936 0.00039936102236421724 0.003197
0.00319489 0.003194888178913738 0.1355
0.00119808 0.0011980830670926517 0.002402
0.00758786 0.0075878594249201275 0.01764
0.00079872 0.0007987220447284345 0.0022

LiangdeLI commented 2 years ago

unless you'd like to do more diagnostics here

A diagnostics is probably necessary for my paper on this benchmark, because I need to prove the results of these tools in each methods are equal.

Eric suggested half missing calls in the last meeting. I tried the Perl program solution in that post. And it seems that the 1000 genome vcf using "0|0" instead of "0/0" to represent calls, so I also tried "|" version of that Perl command. By now they seems not work. I have written a program to scan chr21.vcf to find whether there are actually half missing calls in it.

I probably will make a bit more effort to explore the problem here.

LiangdeLI commented 2 years ago

The plan then is to find those 3350 different AF's variants, explore what's special in their original vcf file records.

jeromekelleher commented 2 years ago

I doubt there's anything interesting going on @LiangdeLI, and definitely not from the perspective of presenting benchmark results. Plink is doing something very slightly different on a tiny fraction of the sites, and this won't affect timings in any way at all. Even if we figured out what plink is doing, I doubt we'd be able to make it do the same things as sgkit/Hail.

I would just note in the text when reporting these results that plink computed slightly different values to sgkit/Hail at 0.3% of sites, and leave it at that.

LiangdeLI commented 2 years ago

Sure, thanks @jeromekelleher! It makes sense. I will just stop exploring this problem here, and only mention this little difference a bit in paper.

hammer commented 2 years ago

FWIW I’m pretty interested in the difference! I’d prefer to keep this open and figure it out, even if it’s not relevant for the benchmarking exercise.

eric-czech commented 2 years ago

Eric suggested half missing calls in the last meeting. I tried the Perl program solution in that post. And it seems that the 1000 genome vcf using "0|0" instead of "0/0" to represent calls

By half call, I mean situations like "./0" actually (the "." means the call is missing). The "|" instead of "/" just suggests phasing. Hail import_vcf seems to point at that now by default btw (i.e. the "PGT" VCF field instead of "GT") but I doubt that's a source of problems since sgkit uses GT and they're the same. Either way, you probably want to make sure Hail is using GT in the future.

LiangdeLI commented 2 years ago

I mean situations like "./0" actually (the "." means the call is missing).

Yeah, I know. The solution in the biostars link you gave me is changing "./0" or "0/." to "./." . I was just saying, I also tried "|" version, as the 1000 Genome vcf use it. And thanks @eric-czech for this more explanation on Hail and Sgkit importing vcf.

@hammer I'm happy to explore this once I have a bit more free time after the schools' application and the paper, unless it becomes urgent.

sgkit-dev / sgkit

0.3% of Sgkit and Hail's allele frequency results are different with Plink's #753