gnu-octave / statistics

The Statistics package for GNU Octave
GNU General Public License v3.0
24 stars 22 forks source link

Median method of BinomialDistribution object fails on i386 architecture #140

Closed rlaboiss closed 5 months ago

rlaboiss commented 5 months ago

On an i386 Debian system, one of the BISTs for BinomialDistribution fails with the following error message:

***** shared pd, t
 pd = BinomialDistribution (5, 0.5);
 t = truncate (pd, 2, 4);
***** assert (cdf (pd, [0:5]), [0.0312, 0.1875, 0.5, 0.8125, 0.9688, 1], 1e-4);
***** assert (cdf (pd, [1.5, 2, 3, 4, NaN]), [0.1875, 0.5, 0.8125, 0.9688, NaN], 1e-4);
***** assert (icdf (pd, [0:0.2:1]), [0, 2, 2, 3, 3, 5], 1e-4);
***** assert (icdf (pd, [-1, 0.4:0.2:1, NaN]), [NaN, 2, 3, 3, 5, NaN], 1e-4);
***** assert (iqr (pd), 1);
***** assert (mean (pd), 2.5, 1e-10);
***** assert (median (pd), 3);
!!!!! test failed
ASSERT errors for:  assert (median (pd),3)

  Location  |  Observed  |  Expected  |  Reason
     ()           2            3         Abs err 1 exceeds tol 0 by 1
shared variables     pd =

      BinomialDistribution object with properties:

              CensoringAllowed: 0
              DistributionCode: bino
              DistributionName: BinomialDistribution
                     InputData: [0x0 double]
                   IsTruncated: 0
                             N: [1x1 double]
                 NumParameters: [1x1 double]
                   ParameterCI: [0x0 double]
           ParameterCovariance: [2x2 double]
          ParameterDescription: [1x2 cell]
              ParameterIsFixed: error: octave_base_value::bool_value(): wrong type argument 'bool matrix'

It succeeds on an amd64 Debian system.

Ultimately, this is caused by a numerical computation difference between the architectures i386 and amd64.

On an amd64 system, we have (the code below is the crucial line in binoinv.m, which is called by BinomialDistribution.median):

> cdf = binocdf  (0 : 9, 5, 0.5);
> bsxfun (@le, 0.5, cdf)
        ans =

          0  0  0  1  1  1  1  1  1  1
> cdf (3) - 0.5
        ans = -1.1102e-16

while, on i386, we have the following:

> cdf = binocdf  (0 : 9, 5, 0.5);
> bsxfun (@le, 0.5, cdf)
        ans =

          0  0  1  1  1  1  1  1  1  1
> cdf (3) - 0.5
        ans = 0

Clearly, this is not a bug in the statistics packages, per se. However, the function binoinv is apparently relying on an unreliable algorithm to get its result.

pr0m1th3as commented 5 months ago

The binoinv function has quite some complexity to deal with various cases in an efficient manner. Honestly, I haven't dig into the maths it's using to compute the inverse. My contribution was merely adding input validation, fixing the coding style, the docstring format and the demo code. If you can work on a fix for this issue, a PR would be welcome, but since Octave has dropped support for 32bit systems, I really don't see the necessity to put the required effort for this. Btw, where did you find an operational i386 system to run Octave? :smile: Note: since Octave 9.1 is around the corner, the next statistics major release (1.7; expected by the end of summer) will drop support for Octave 7.x and if I am not mistaken the 8 series do not have 32 bit bundles for windows systems

rlaboiss commented 5 months ago

Btw, where did you find an operational i386 system to run Octave? 😄

It is pretty straightforward to set up a 32-bit, fully operational Debian system on a 64-bit Debian system using schroot.

If you can work on a fix for this issue, a PR would be welcome, but since Octave has dropped support for 32bit systems, I really don't see the necessity to put the required effort for this.

I looked deeper in the issue and I think that the BIST in binoinv hit a corner case. Indeed, the median value of a binomial distribution when n is odd and p = ½ is not unique (Nowakowski, 2021). For instance, the median values of BinomialDistribution(5,0.5) are 2 and 3. These are the values obtained independently on the i386 and the amd64 systems, respectively. In this case, the correct value for the median would be 2.5 (which is also the mean, BTW). In sum, the returned value is incorrect on both architectures!

A simple fix to the problem would be to treat p = ½ as a special case in binoinv.m, by returning the mean value, instead of doing the computations.

pr0m1th3as commented 5 months ago

That's a relatively easy fix, the only think that is of concern is that (once again) MATLAB produces the wrong(?) results.

>> pd = makedist ('Binomial', 'N', 5, 'p', 0.5)

pd = 

  BinomialDistribution

  Binomial distribution
    N =   5
    p = 0.5

>> median(pd)

ans =

     3

>> mean(pd)

ans =

    2.5000

>> icdf(pd, 0.5)

ans =

     3
rlaboiss commented 5 months ago

If you are concerned with bug-compatibility with MATLAB, you might call __traditional__() in binoinv.m, using the current code if the returned value is 1 (true) or using the correct code (the on that returns the mean value as the median when n is odd and p = ½) if the returned value is 0 (false).

The behavior of __traditional__() is controlled by the use of the command line option --braindead (or --traditional, to be more polite).

pr0m1th3as commented 5 months ago

I looked deeper in the issue and I think that the BIST in binoinv hit a corner case. Indeed, the median value of a binomial distribution when n is odd and p = ½ is not unique (Nowakowski, 2021). For instance, the median values of BinomialDistribution(5,0.5) are 2 and 3. These are the values obtained independently on the i386 and the amd64 systems, respectively. In this case, the correct value for the median would be 2.5 (which is also the mean, BTW). In sum, the returned value is incorrect on both architectures!

A simple fix to the problem would be to treat p = ½ as a special case in binoinv.m, by returning the mean value, instead of doing the computations.

Can you explain the required changes in a bit more detail, please? The binoinv function's identity is x = binoinv (p, n, ps). Which variable you refer to as p, p or ps?

rlaboiss commented 5 months ago

Can you explain the required changes in a bit more detail, please? The binoinv function's identity is x = binoinv (p, n, ps). Which variable you refer to as p, p or ps?

As p, I am referring to ps.

pr0m1th3as commented 5 months ago

So, what about p then? I am sorry for the series of questions but I can't really follow through the math in the Nowakowski, 2021 paper. Somehow it does not make sense to me that x = n/2 when n is odd and ps = 0.5 for any p-value in p. I would assume that this edge case only applies when p is also 0.5. After all, the median is computed at p=0.5 as well. But I would really like some verification for this.

rlaboiss commented 5 months ago

Indeed, it is confusing and it's my fault. I apologize for that. When I wrote:

A simple fix to the problem would be to treat p = ½ as a special case in binoinv.m, by returning the mean value, instead of doing the computations.

I was meaning the median method of BinomialDistribution and not really binoinv.

I just opened MR #142 for fixing the issue. Note that it is a partial fix, because it does not cope with the truncated case.

pr0m1th3as commented 5 months ago

Thank you for the clarrification and the PR of course. FYI, the methods for the truncated binomial distribution object do not work yet, see issue #128. PR #142 has been merged so I close this issue as done.