scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
13.07k stars 5.19k forks source link

ENH: special: add ufunc for logarithm of the binomial coefficient #19473

Open dschmitz89 opened 1 year ago

dschmitz89 commented 1 year ago

Is your feature request related to a problem? Please describe.

In many discrete distributions the logarithm of the binomial coefficient is used in the logpmf calculation. As the binomial coefficient is fundamentally important elsewhere as well, I think it is worth adding a dedicated function for that in special. The binomial coefficient also quickly overflows while its logarithm does not.

Describe the solution you'd like.

A function special.logbinom(x, y).

CC @steppi , as you are working on binom in #19471 , do you have a strong opinion here by any chance?

aadya940 commented 1 year ago

Hey, If we're implementing this, I have a small implementation here:

cimport cython
from cython.parallel import prange

cdef extern from "math.h":
    double log(double x)

'''
Use Stirling's approximation if approximate = True
i.e log(n!) = n*log(n) - n
'''

cdef double _factorial(double x) nogil:
    cdef double i = 1
    cdef double _fact = 1
    for i in prange(1, x + 1):
        _fact *= i
    return _fact

cdef double _stirlings_approximation(double x):
    return x * log(x) - x

@cython.ufunc
cdef double _logbinomial(double x, double y, bint approximate=True):
    cdef double result
    if approximate:
        result = _stirlings_approximation(x) - _stirlings_approximation(y) - _stirlings_approximation(x - y)
    else:
        result = log(_factorial(x)) / log((_factorial(y) * _factorial(x - y)))
    return result
steppi commented 1 year ago

Using Stirling's isn't a great approach. I think for the general case you'd want to rely on

$${n \choose k} = \frac{1}{(n+1)\mathrm{B}(n - k + 1, k + 1)}$$

and use betaln.

steppi commented 1 year ago

@dschmitz89, I have no objections to adding this; it's just a matter of bandwidth for someone to implement it properly and someone else to review.

@aadya940, thanks for you interest, but beyond your use of Stirling's approximation, it looks like you're just computing integer factorials, which won't work for double x and y in general, leads to a lot of duplicate work in this case for the non-negative integer cases it can handle, and will cause intermediate overflow. You also seem to be using prange for a calculation that is inherently sequential. If you're interested in learning more about numerical computation of special functions I can recommend books or other learning materials.

steppi commented 1 year ago

@dschmitz89, if you want to submit a PR; the way to go would be to do log(binom) for the integer case from the binom source code, if (k == kx && (std::fabs(n) > 1E-8 || n == 0)) {, and otherwise do -cephes::lbeta(1 + n - k, 1 + k) - std::log(n + 1). You could just add your implementation to _binom.h from https://github.com/scipy/scipy/pull/19471, once that PR gets in. But it would have to go through the mailing list first.

aadya940 commented 1 year ago

Hey @steppi , I would love to read more about this topic. Do recommend some material I can read. I'll also rewrite my code based on the Points you mentioned! Thanks!

ilayn commented 1 year ago

Just a word of caution, we cannot and actually do not use Numerical Recipes because the license is not compatible for distributing code under open source licenses. Even looking at the code is debatable in this context for inspiration hence overall we completely avoid NR.

And I think that link leads to an unofficial copy of the book

steppi commented 1 year ago

@aadya940, I was actually planning to warn you not to read numerical recipes. Beyond the unusual and restrictive licensing, the code quality isn't great either, though the explanations are good. (I've heard the original Fortran edition did feature idiomatic Fortran which might have been considered good for its time.)

If you can get a hold of a copy, Nico M. Temme's Special Functions: An Introduction to the Classical Functions of Mathematical Physics, is excellent, but will require some familiarity with calculus and undergraduate level complex analysis. You might want to take or sit in on an intro complex analysis course at your university if you don't have experience with the subject.

I've only skimmed through them in the past, but these freely available lecture notes by John Michael Finn seem good, and are more gentle since they walk through a lot of the prerequisite material Temme assumes you already know. You won't get the same level of depth here though. Your goal should be to build yourself up to being able to grapple with Temme.

If you haven't already, it would be good to become familiar with the content of the classic article What Every Computer Scientist Should Know About Floating-Point Arithmetic. The Boost math documentation is also a great resource.

If you want to contribute to SciPy, make sure you check the license before studying any relevant source code. Oddball things like NR or GPL'd things like the GNU Scientific Library or most R source code are out of bounds.

dschmitz89 commented 1 year ago

Thanks for your suggestions @steppi . Might be a good opportunity to dig into C++ land again for me. But I am rather swamped at the moment so if anyone has time before, by all means write to the mailing list and submit a PR :).

aadya940 commented 1 year ago

@aadya940, I was actually planning to warn you not to read numerical recipes. Beyond the unusual and restrictive licensing, the code quality isn't great either, though the explanations are good. (I've heard the original Fortran edition did feature idiomatic Fortran which might have been considered good for its time.)

If you can get a hold of a copy, Nico M. Temme's Special Functions: An Introduction to the Classical Functions of Mathematical Physics, is excellent, but will require some familiarity with calculus and undergraduate level complex analysis. You might want to take or sit in on an intro complex analysis course at your university if you don't have experience with the subject.

I've only skimmed through them in the past, but these freely available lecture notes by John Michael Finn seem good, and are more gentle since they walk through a lot of the prerequisite material Temme assumes you already know. You won't get the same level of depth here though. Your goal should be to build yourself up to being able to grapple with Temme.

If you haven't already, it would be good to become familiar with the content of the classic article What Every Computer Scientist Should Know About Floating-Point Arithmetic. The Boost math documentation is also a great resource.

If you want to contribute to SciPy, make sure you check the license before studying any relevant source code. Oddball things like NR or GPL'd things like the GNU Scientific Library or most R source code are out of bounds.

Thank you so much @steppi, I skimmed through the material and It looks really interesting, I'll try to complete it as soon as possible. Also, added Complex Analysis for the next sem. Hoping to contribute something good to SciPy soon, Thanks :-)

steppi commented 1 year ago

Sounds good @aadya940. Complex Analysis is a really beautiful subject. Since the complex derivative is required to be the same along all directions, complex differentiable functions have really nice structure and properties. Looking forward to seeing your first contribution.

aadya940 commented 12 months ago

Sounds good @aadya940. Complex Analysis is a really beautiful subject. Since the complex derivative is required to be the same along all directions, complex differentiable functions have really nice structure and properties. Looking forward to seeing your first contribution.

Thank You @steppi

mdhaber commented 2 months ago

Re: https://github.com/scipy/scipy/issues/19473#issuecomment-1793467395: this is quite good for $k > 1$.

image

The inaccurate band when k is close to n is mostly due to the inaccuracy of betaln itself. For instance, even when there is no loss of precision due to subtraction in the input:

from scipy import special
import mpmath as mp
mp.dps = 1000

def beta_mp(x, y):
    x, y = mp.mpf(float(x)), mp.mpf(float(y))
    return float(mp.log(mp.beta(x, y)))

n = 10000000
k = 10
x = n - k + 1
y = k + 1
assert int(x) == float(x) and int(y) == float(y)
res = special.betaln(x, y)
ref = beta_mp(x, y)
print(abs(res - ref)/abs(ref))
# 2.9874447614386427e-11

Yet if n is increased by a factor of 10, the error drops back to 1e-16. This must be at some threshold where the algorithm switches.

However, for k << 1, the binomial coefficient is very close to 1, so the logarithm is very close to zero, and the relative error of the result is poor. Part of this is because of the explicit addition of 1 to small k: if $\log_{10}k < -16$ or so, then k + 1 == 1, and the result is exactly 0, so we get a relative error of 100%. This can be avoided if you use the relationship:

image

and take the log of both sides. But even then the error is not good because there is subtractive cancellation error between special.betaln(n-k, k) and np.log(k) for small k. (There may also be loss of precision in computing n - k and subtractive cancellation error between np.log(n-k) and np.log(n), but the difference between those two is easier to approximate closely). So really, to get good relative error for small $k$, I think we need another relationship, such as an asymptotic approximation for $k \approx 0$. Wolfram Alpha can give a Taylor series expansion, but there might be something better out there.

If anyone cares about the log of binomial coefficients extremely close to but not equal to 1, that is.