stdlib-js / stdlib

✨ Standard library for JavaScript and Node.js. ✨
https://stdlib.io
Apache License 2.0
4.6k stars 520 forks source link

Hypergeometric cdf function doesn't match R after many decimal points? #288

Closed vincerubinetti closed 4 years ago

vincerubinetti commented 4 years ago

Take the following javascript code:

// https://stdlib.io/docs/api/v0.0.90/@stdlib/stats/base/dists/hypergeometric/cdf
import cdf from '@stdlib/stats/base/dists/hypergeometric/cdf';

// the stdlib cdf function calculates P(X <= k), but what we really want is
// P(X >= k). therefore, provide k-1 as X instead of k to get P(X < k), then
// subtract the result from 1 to get P(X >= k)
const hyperGeometricTest = (k, K, n, N) => 1 - cdf(k - 1, N, K, n);
console.log(hyperGeometricTest(1, 71, 86, 2443))

And the following R code:

# computed/run with https://rdrr.io/snippets/
k <- 1; K <- 71; n <- 86; N <- 2443; sum <- 0; for(i in k:K) { sum <- sum + dhyper(i, K, N-K, n); }; print(sum, digits = 20);

which according to my best understanding of statistics, should result in the same number. However, here are the results, respectively:

0.9243998310893528
0.92439983108944501211

As you can see, they're almost identical, up to like the 12th decimal place. It seems like some kind of numerical error in the cdf function. Although it's possible I'm misusing the function, and the two code snippets shouldn't be exactly the same value.

Planeshifter commented 4 years ago

This discrepancy is likely not an error on our part. The statistical libraries in the various programming languages will use different implementations of certain mathematical functions or use another formula for evaluating a certain function, both of which may be mathematically correct but return different results due to the limitations of floating-point numbers, which can only give approximations of the exact values.

For example, with Python's scipy package you obtain 0.9243998310892152 for your desired quantity:

from scipy.stats import hypergeom
rv = hypergeom(2443, 71, 86)
1 - rv.cdf(0)
Planeshifter commented 4 years ago

For your new example, a similar pattern is observed.

Evaluated in the stdlib REPL:

In [1]: 1-base.dists.hypergeometric.cdf( 6, 2443, 71, 224 )
Out[1]: 0.47990933315654394

Your manual calculation in R (you could also use the phyper function in R):

> k <- 7; K <- 71; n <- 224; N <- 2443; sum <- 0; for(i in k:K) { sum <- sum + dhyper(i, K, N-K, n); }; print(sum, digits = 20);
[1] 0.47990933315812023263

and again Python for comparison:

>>> from scipy.stats import hypergeom
>>> rv = hypergeom(2443, 71, 224)
>>> 1 - rv.cdf(6)
0.4799093331576778

Still doesn't look like an error to me. But please let us know should you encounter a set of values for which the results are markedly off and I will take a deeper look.

Thanks for investigating and filing this issue!

vincerubinetti commented 4 years ago

~Sorry, I deleted that middle comment because I originally observed a large difference in stdlib (like 0.02), but realized I had typed in the parameters wrong. Feel free to delete your response comment.~

I understand it might not be a "bug" per se, but perhaps an enhancement or something to look into? In this issue, I compare the results of a few different methods. And the "old adage" was an old version of my project that manually calculated everything -- that is, didn't outsource any functions and thus reduced middle-man steps and rounding errors -- and its output more closely matches the output of R. Also, my understanding was that R is sort of the golden standard, and its results should be the source of truth for tests like these.

I would think even with floating point errors, it should still be good up to many more decimal places than 10.

Planeshifter commented 4 years ago

No worries.

While R might be perceived as the "gold standard", this is only due to it being written by statisticians. There is nothing which will make its results automatically more correct than say Python, which in the given example also differs from R after the 12 decimal place, just like stdlib.

It's probably not just floating point errors but also a different implementation. If you look at the source code, we use a recurrence relation to evaluate the CDF.

If R was licensed under a permissive open-source license, we probably would have just ported their code to JavaScript when implementing these packages back in the day as part of stdlib. However, R is licensed under the GPL 2, so we couldn't do that given our permissive licensing but had to re-implement all functions using other reference implementations or writing them from scratch.

Hence, it's not likely that we will change the implementation, as the only way to exactly mirror R's output would be to actually port their implementation of the hypergeometric CDF and PMF.

vincerubinetti commented 4 years ago

I see, that makes sense to me. I should note that I'm not a statistician; I'm a front-end developer who took one stats class in college and temporarily has to wear the stat hat :x

I thought for these things there would be rigid formulas, meaning you could only sort of implement them so many ways. Or at least, any implementation should always tend towards the same exact solution, given enough floating point precision.

But it seems like this stuff might be more like numerical analysis than exact formulas. I suppose you can close the issue, and my team will have to decide whether they're okay with the "inaccuracy".

Planeshifter commented 4 years ago

Okay, closing this issue. Thank you for your considering stdlib for your project(s)!

If you have further questions, please feel free to reach out either on our Gitter channel or via email.

kgryte commented 4 years ago

@vincerubinetti Thanks for reaching out.

For your use case in ADAGE, I can see why you'd want to have the exact same result as R. I am guessing that the original ML work for the Greene Lab was written in R?

However, as @Planeshifter mentions, no one "right" answer exists. While our implementations are by no means infallible, we do test against reference implementations. For this particular function, we test against Julia. You can find the tests for evaluated the CDF here and for the PMF here. Included in the test code is the delta, which provides a sense as to how well our implementations track a particular reference implementation.

In this particular case, the PMF evaluates the natural log of the factorial function, which, in turn, evaluates the natural log of the gamma function. Our reference implementation tracking error can be found here.

In short, not only can different approximation formulas differ (here, we use a recursive formula), but also the underlying special functions can differ. In stdlib, the underlying special function tracks well with the reference implementation (here, Julia). Given that the CDF formula, while one among many, is valid, we are not particularly surprised that we might differ from R for this particular function. In fact, we expect to differ more often than not, given the fact that errors (rounding and approximation) in underlying implementations propagate. We see that not only for stdlib and R, but also for SciPy and R, and Julia and R, etc.

My advice for your use case is to worry less about exactly equal numerical results and more that the overall conclusions derived from your application remain the same. If the conclusions of an analysis hinge on differences in the 12th decimal place, something is probably flawed in the analysis, as experiments are rarely so precise and experimental error is rarely so small. My hunch is that significant digits calculations based on experimental error would reveal that the numerical differences observed here are simply noise.

That said, should you have an alternative (non-GPL) algorithm which demonstrates "better" behavior, we're more than willing to consider updating. But, for now, we don't see the observed differences as cause for concern.

Thanks again for reaching out!

vincerubinetti commented 4 years ago

Hi @kgryte, Thank you for taking the time to explain that to me in great detail. You certainly didn't have to do that.

I have a somewhat different but related question now. In our p value calculations, we are sometimes getting values of 0 (0.0 x 10^0), presumably due to some kind of numerical error. I'm not doing any other transformations to the p values that would be causing the error (I am applying fdr correction afterward, but removing it doesn't seem to change the presence of the 0's). So I believe the error comes straight from the stdlib CDF function.

My team wants to show something like < [some epsilon value] instead, for all values less than a certain threshold (including the 0's), to indicate to the user that it is beyond the threshold of precision.

Given what I said above, what would be the most appropriate value to display here? The epsilon value for javascript floats? The Julia deltas you mentioned above? The deltas I observed between stdlib and R?

Also, this might be impossible given what you've told me. But since CDF is a mathematical formula, there has to be an exact, true answer (even if it is irrational). Is there a way to know for sure how many decimal places a particular implementation will always be accurate to? If such a value was known, I would probably want to use that as the epsilon.

Planeshifter commented 4 years ago

Hi @vincerubinetti,

observing p-values of zero does not indicate that something is awry. It just means that the observed data is wholly inconsistent with the distribution stipulated under your null hypothesis. If you observe very small p-values, you can safely reject your null hypothesis (recall that in many published scientific papers p-values of < 0.05 or < 0.01 are considered sufficient for rejection). Ronald Fisher, the father of much of modern statistics, said the following with regards to the p-value in the context of a chi-squared independence test: "Provided the deviation is clearly significant, it is of no practical importance whether P is .01 or ·.000,001" (See: https://psychclassics.yorku.ca/Fisher/Methods/chap4.htm).

I would not be concerned with numerical deviations in the result of a statistical test. One should be far more worried about whether the specified probability modal is appropriate, issues of multiple testing, etc.

However, if you believe there to be an error in the CDF implementation, could you please provide an example set of values for which a result of zero is returned in stdlib but not in R or Python and we will have a look?

In general, I would say it's a good idea to not display an exact value but < [some epsilon value], as it also de-emphasizes the actual value and does not project certainty when there is none to be had with hypothesis testing in the first place. The New England Journal of Medicine for example has the following style guideline:

In general, P values larger than 0.01 should be reported to two decimal places, and those between 0.01 and 0.001 to three decimal places; P values smaller than 0.001 should be reported as P<0.001. Notable exceptions to this policy include P values arising from tests associated with stopping rules in clinical trials or from genome-wide association studies. (https://www.nejm.org/author-center/new-manuscripts)

There is no standard way to tell how many decimal places an implementation will be accurate to. There is also is no basis to assume that R, Julia, or Python would provide you a ground truth, to which the stdlib implementation should be compared to in order to arrive at an answer to this question.

vincerubinetti commented 4 years ago

Okay, I understand that there is no implementation that provides a ground truth to arbitrary precision. I'll need my team (who actually understand statistics) to help me. But let me try to clarify my situation.

I think my users expect and understand that the p values shown shouldn't be exactly accurate to many decimal places. They're not using the values in any kind of calculations. They're using them to get a general feel for which items in a table are more significant than others.

See this screenshot (website here):

image

I think this is more about handling the very specific case of a p value of 0, which would indicate infinite significance. Rather than not including them all, our idea is to do this:

image

Where the value in that case is simply the javascript floating point epsilon. Based on my best understanding of the situation, that is probably the most appropriate value to put there.

kgryte commented 4 years ago

@vincerubinetti You could also follow the suggestion of @Planeshifter in which the 3rd column would be

P < 0.001
P < 0.001
P < 0.001
...
kgryte commented 4 years ago

...or even just

< 0.001
< 0.001
< 0.001
...

In short, just forgo displaying double-precision floating-point p-values altogether.

kgryte commented 4 years ago

...but...if you feel the need to display a double, then using floating-point epsilons should be fine, as you have done.

vincerubinetti commented 4 years ago

Thank you both again for taking time to discuss this. I'll talk to my team and see what they think is appropriate.

It makes sense to me that showing ultra small p values is problematic, but maybe my team feels it's still beneficial to show the high precision, even if it's not completely accurate. If we do end up limiting P < 0.001, that would end up being like half of our table.

kgryte commented 4 years ago

No worries! Thank you for using the project!