tskit-dev / tskit

Population-scale genomics
MIT License
147 stars 70 forks source link

Should 2-locus ratio statistics return NaN on 0/0? #2907

Closed lkirk closed 4 months ago

lkirk commented 5 months ago

When we encounter situations in $r$, $r^2$, and $D^{\prime}$ where we have $\frac{0}{0}$, should we return $0$ or NaN? @apragsdale brings up a good point: What is the limit behavior as we approach $\frac{0}{0}$?

Let's follow up on this to ensure correctness of the API.

jeromekelleher commented 4 months ago

I think returning NaN is likely the simplest by just letting the IEEE double mechanisms handle the division. That's the approach we took the the one-locus stats api. See here and here for how we did this in the Python examples to avoid division by zero warnings.

lkirk commented 4 months ago

I talked to @apragsdale and we want to move forward with returning NaNs.

There's one lingering issue in my head: Technically dividing by zero in C (according to the c99 standard) is undefined behavior (see section 6.5.5.5 in this document -- quoted below):

The result of the / operator is the quotient from the division of the first operand by the second; the result of the % operator is the remainder. In both operations, if the value of the second operand is zero, the behavior is undefined.

We can observe that the behavior is only slightly different in different mainstream compilers, I was testing to see if hardcoded vs dynamic values would be any different... though I wonder if the string conversion is doing something here... anyways, maybe a bit too far in the weeds, I just wanted to point out that it's technically UB.

lkirk@skittles ~/repo/tskit/python/divzero-tests (git)-[ld-matrix-nan-ratio-statistics] % cat main.c
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char **argv) {
    printf("stat 1/0: %f\n", 1.0 / 0.0);
    printf("stat 0/0: %f\n", 0.0 / 0.0);
    if (argc == 2) {
        printf("dyn  1/0: %f\n", 1.0 / atof(argv[1]));
        printf("dyn  0/0: %f\n", 0.0 / atof(argv[1]));
    }
}
lkirk@skittles ~/repo/tskit/python/divzero-tests (git)-[ld-matrix-nan-ratio-statistics] % gcc -Wall -Wextra -Werror -Wpedantic -W -Wmissing-prototypes -Wstrict-prototypes -Wconversion -Wshadow -Wpointer-arith -Wcast-align -Wcast-qual -Wwrite-strings -Wnested-externs -fshort-enums -fno-common main.c -o gcc
lkirk@skittles ~/repo/tskit/python/divzero-tests (git)-[ld-matrix-nan-ratio-statistics] % clang -Wall -Wextra -Werror -Wpedantic -W -Wmissing-prototypes -Wstrict-prototypes -Wconversion -Wshadow -Wpointer-arith -Wcast-align -Wcast-qual -Wwrite-strings -Wnested-externs -fshort-enums -fno-common main.c -o clang
lkirk@skittles ~/repo/tskit/python/divzero-tests (git)-[ld-matrix-nan-ratio-statistics] % ./gcc 0                              
stat 1/0: inf
stat 0/0: -nan
dyn  1/0: inf
dyn  0/0: -nan
lkirk@skittles ~/repo/tskit/python/divzero-tests (git)-[ld-matrix-nan-ratio-statistics] % ./clang 0                            
stat 1/0: inf
stat 0/0: nan
dyn  1/0: inf
dyn  0/0: -nan
lkirk@skittles ~/repo/tskit/python/divzero-tests (git)-[ld-matrix-nan-ratio-statistics] % python -c 'import numpy as np;print("1/0", np.float64(1)/np.float64(0));print("0/0", np.float64(0)/np.float64(0))'                                       
<string>:1: RuntimeWarning: divide by zero encountered in scalar divide
1/0 inf
<string>:1: RuntimeWarning: invalid value encountered in scalar divide
0/0 nan

Though, it appears when the -nan and nan values get passed into numpy, the sign is stripped off, so these compilers essentially implement the same thing.

lkirk commented 4 months ago

I think what I'm suggesting is that the code explicitly return NaN values instead of relying on the compiler.

However, I think that the rest of the stats C code does division without explicitly returning NaN values, so I'm happy to just do that.

jeromekelleher commented 4 months ago

That's interesting - I'm pretty sure IEEE 754 defines division by zero behaviour, so I thought it was the hardware that decided this rather than compiler. I do remember having some different results on old 32 bit hardware without a proper FPU.

We fretted quite a bit about this IIRC and did try to avoid division by zeros initially. However, it turns out to be surprisingly tricky to avoid completely and you end up with quite subtle and complicated numerical code around the special cases. So, the cure is worse than the disease here I think - in practise the majority of platforms will return NaN as required, and we have less code to maintain.