mdhaber / scipy

Scipy library main repository
http://scipy.org/scipylib/
BSD 3-Clause "New" or "Revised" License
1 stars 5 forks source link

WIP: Boost stats distributions #48

Closed mckib2 closed 2 years ago

mckib2 commented 3 years ago

@mdhaber @WarrenWeckesser

Reference issue

Beta Distribution:

Binomial Distribution / Negative Binomial Distribution

adding more as we find them...

What does this implement/fix?

Additional information

Boost distributions can be used like this:

from scipy.stats.boost import beta # or binom, nbinom currently

Function the same as their scipy.stats counterparts.

mckib2 commented 3 years ago

I've converted the "demos" from the stats-boost-deliverables repo to "tests". They may or may not stay in this form, it's mainly a way to keep track of which issues are resolved using boost. Run like this:

pytest scipy/stats/boost/tests/test_boost_stats.py
mdhaber commented 3 years ago

Cool, so this is building in SciPy now?

mckib2 commented 3 years ago

@mdhaber I built from clean just now. If you have a minute, could you pull and see if it builds for you?

mdhaber commented 3 years ago

Looks easy:

Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "C:\Users\matth\scipy\scipy\stats\boost\include\func_defs.hpp" namespace "" nogil:
                   ^
------------------------------------------------------------

func_defs.pxd:1:20: Invalid unicode escape '\U'

Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "C:\Users\matth\scipy\scipy\stats\boost\include\func_defs.hpp" namespace "" nogil:
                   ^
------------------------------------------------------------

func_defs.pxd:1:20: Invalid unicode escape '\U'

Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "C:\Users\matth\scipy\scipy\stats\boost\include\func_defs.hpp" namespace "" nogil:
                   ^
------------------------------------------------------------

func_defs.pxd:1:20: Invalid unicode escape '\U'

Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "C:\Users\matth\scipy\scipy\stats\boost\include\func_defs.hpp" namespace "" nogil:
                   ^
------------------------------------------------------------

func_defs.pxd:1:20: Invalid unicode escape '\U'

Error compiling Cython file:
------------------------------------------------------------
...
cdef extern from "C:\Users\matth\scipy\scipy\stats\boost\include\func_defs.hpp" namespace "" nogil:
                   ^
------------------------------------------------------------

func_defs.pxd:1:20: Invalid unicode escape '\U'
Traceback (most recent call last):
  File "C:\Users\matth\scipy\tools\cythonize.py", line 311, in <module>
    main()
  File "C:\Users\matth\scipy\tools\cythonize.py", line 307, in main
    find_process_files(root_dir)
  File "C:\Users\matth\scipy\tools\cythonize.py", line 296, in find_process_files
    for result in pool.imap_unordered(lambda args: process(*args), jobs):
  File "C:\Users\matth\Anaconda3\envs\scipydev\lib\multiprocessing\pool.py", line 748, in next
    raise value
  File "C:\Users\matth\Anaconda3\envs\scipydev\lib\multiprocessing\pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "C:\Users\matth\scipy\tools\cythonize.py", line 296, in <lambda>
    for result in pool.imap_unordered(lambda args: process(*args), jobs):
  File "C:\Users\matth\scipy\tools\cythonize.py", line 230, in process
    processor_function(fromfile, tofile, cwd=path)
  File "C:\Users\matth\scipy\tools\cythonize.py", line 95, in process_pyx
    raise Exception('Cython failed')
Exception: Cython failed
Traceback (most recent call last):
  File "setup.py", line 588, in <module>
    setup_package()
  File "setup.py", line 572, in setup_package
    generate_cython()
  File "setup.py", line 334, in generate_cython
    raise RuntimeError("Running cythonize failed!")
RuntimeError: Running cythonize failed!
mckib2 commented 3 years ago

Windows, my old nemesis... It does look like an easy fix. Will look at a little later or tomorrow, please feel free to patch before then if you can't wait

mdhaber commented 3 years ago

seems ok on mac. i'll work with that.

mdhaber commented 3 years ago

Built and the boost tests passed!

mdhaber commented 3 years ago

This is awesome. Can you see if these pass the appropriate tests in test_continuous_basic.py, test_distributions.py (and test_discrete_basic.py / test_discrete_distns.py)?

mckib2 commented 3 years ago

Windows build should be working (fingers crossed). Progress on continuous distributions, haven't touched tests for discrete distributions yet.

mdhaber commented 3 years ago

Sorry but same error : /

mdhaber commented 3 years ago

Good news on times, though. Boost is faster than SciPy for several tests.

1.74s call     scipy/stats/tests/test_continuous_basic.py::test_cont_basic[ncx2-arg75]
0.99s call     scipy/stats/tests/test_continuous_basic.py::test_cont_basic[boost.ncx2-arg105]

1.37s call     scipy/stats/tests/test_continuous_basic.py::test_cont_basic[boost.beta-arg104]
0.67s call     scipy/stats/tests/test_continuous_basic.py::test_cont_basic[beta-arg4]

1.22s call     scipy/stats/tests/test_continuous_basic.py::test_moments[ncx2-arg76-True-True-False]
0.39s call     scipy/stats/tests/test_continuous_basic.py::test_moments[boost.ncx2-arg107-True-True-False]

0.51s call     scipy/stats/tests/test_continuous_basic.py::test_moments[beta-arg4-True-True-False]
0.48s call     scipy/stats/tests/test_continuous_basic.py::test_moments[boost.beta-arg106-True-True-False]

I'm asking @swallan to make it easier to add new distributions to the benchmarks so we can get a more accurate assessment.

mckib2 commented 3 years ago

@mdhaber Thanks for taking a look! I'm working towards conference submission deadline, but after this coming Wednesday I should have some time to resume work on this

mckib2 commented 3 years ago

Sorry but same error : /

I've tried converting to PosixPath then calling str(). Hopefully that fixes it -- I don't have scipy building on Windows right now, but if the issue persists I may try to set it up again

mckib2 commented 3 years ago

RE https://github.com/scipy/scipy/issues/7406: boost.binom.ppf does the right thing at q=0, 0.5, and 1

mdhaber commented 3 years ago

Thanks. I'll try compiling again tonight. Can you patch the wrapper for Beta so it works at x=1?

mdhaber commented 3 years ago

Now working on Windows! W00t! Next I'll take a closer look at the "interesting" Python code, like tests. I'll also look out for other issues this might solve. @swallan even though your SciPy stats benchmarking PR isn't merged yet, can you:

@WarrenWeckesser do you have thoughts on the lower level things?

mckib2 commented 3 years ago

Can you patch the wrapper for Beta so it works at x=1?

Now patched to return inf when x=1, beta<1 (I think that's the right answer, happy to change if not) EDIT: returns 0, not inf

swallan commented 3 years ago

I have created a new branch called benchmark_dist_compare to work with discrete distributions and to show the times for the scipy and boost methods:

Screen Shot 2020-12-22 at 12 39 55 AM
mdhaber commented 3 years ago

Thanks @swallan! I'll review that PR shortly but these results look great.

@mckib2 I think there's one more thing we want to do before we post this PR on SciPy's GitHub. I checked with @WarrenWeckesser, and ultimately, we're not going to want to have separate scipy.stats and scipy.stats.boost versions of these distributions; we'll be replacing SciPy's implementations with the underlying Boost code. So I think the thing to do is to "drop in" your replacements from beta.py and binom.py/nbinom.py into _continuous_distns.py and _discrete_distns.py, respectively.

For beta, we should preserve the code that is not upgraded by Boost, like _fitstart and _fit.

mckib2 commented 3 years ago

So I think the thing to do is to "drop in" your replacements from beta.py and binom.py/nbinom.py into _continuous_distns.py and _discrete_distns.py, respectively.

They should be "dropped in" now, all the tests are passing on my machine and I'm curious to know if that holds on other machines as well.

For beta, we should preserve the code that is not upgraded by Boost, like _fitstart and _fit.

I kept all the fitting code as well as logpdf implementations (see reviews). I'm not sure whether it's better to use the existing logpdf implementations or take the log of the boost pdf implementation -- I'll let the council decide

mdhaber commented 3 years ago

Thanks @mckib2. On my mac, there's one failure that appears to be "real"; the other three seem to be due to a bad merge (and easily fixed).

=================================== FAILURES ===================================
___________________________ TestBinom.test_warns_p0 ____________________________
scipy/stats/tests/test_distributions.py:233: in test_warns_p0
    assert_equal(stats.binom(n=2, p=0).mean(), 0)
        self       = <scipy.stats.tests.test_distributions.TestBinom object at 0x11d0cb390>
scipy/stats/_distn_infrastructure.py:488: in mean
    return self.dist.mean(*self.args, **self.kwds)
        self       = <scipy.stats._distn_infrastructure.rv_frozen object at 0x11d0e63d0>
scipy/stats/_distn_infrastructure.py:1321: in mean
    res = self.stats(*args, **kwds)
        args       = ()
        kwds       = {'moments': 'm', 'n': 2, 'p': 0}
        self       = <scipy.stats._discrete_distns.binom_gen object at 0x11d0e6090>
scipy/stats/_distn_infrastructure.py:1121: in stats
    mu, mu2, g1, g2 = self._stats(*goodargs)
        args       = (array(2), array(0))
        cond       = True
        default    = array(nan)
        goodargs   = [array([2]), array([0])]
        kwds       = {'moments': 'm', 'n': 2, 'p': 0}
        loc        = array([0])
        moments    = 'm'
        output     = []
        scale      = array([1])
        self       = <scipy.stats._discrete_distns.binom_gen object at 0x11d0e6090>
scipy/stats/_discrete_distns.py:80: in _stats
    _boost._binom_skewness(n, p),
E   RuntimeWarning: divide by zero encountered in _binom_skewness
        n          = array([2])
        p          = array([0])
        self       = <scipy.stats._discrete_distns.binom_gen object at 0x11d0e6090>

It's not building on Windows anymore:

building 'boost_test_build' library
compiling C sources
creating build\temp.win-amd64-3.7\scipy\_build_utils
creating build\temp.win-amd64-3.7\scipy\_build_utils\boostinator
creating build\temp.win-amd64-3.7\scipy\_build_utils\boostinator\boost_test_build
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\matth\scipy\scipy\_build_utils\boostinator -IC:\Users\matth\Anaconda3\envs\scipydev\lib\site-packages\numpy\core\include -IC:\Users\matth\Anaconda3\envs\scipydev\include -IC:\Users\matth\Anaconda3\envs\scipydev\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpscipy\_build_utils\boostinator\boost_test_build\boost_test_build.cpp /Fobuild\temp.win-amd64-3.7\scipy\_build_utils\boostinator\boost_test_build\boost_test_build.obj
boost_test_build.cpp
scipy\_build_utils\boostinator\boost_test_build\boost_test_build.cpp(4): fatal error C1083: Cannot open include file: 'boost/math/distributions.hpp': No such file or directory
error: Command "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\matth\scipy\scipy\_build_utils\boostinator -IC:\Users\matth\Anaconda3\envs\scipydev\lib\site-packages\numpy\core\include -IC:\Users\matth\Anaconda3\envs\scipydev\include -IC:\Users\matth\Anaconda3\envs\scipydev\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.19041.0\cppwinrt /EHsc /Tpscipy\_build_utils\boostinator\boost_test_build\boost_test_build.cpp /Fobuild\temp.win-amd64-3.7\scipy\_build_utils\boostinator\boost_test_build\boost_test_build.obj" failed with exit status 2
mckib2 commented 3 years ago

It's not building on Windows anymore

Well at least we know the boost build test works...

I'll check out these and other issues either today or this weekend.

EDIT: merge and test errors should be fixed, not sure what happened to the Windows build

mckib2 commented 3 years ago

@mdhaber I used this guide to build on Windows and I'm not getting the error you reported. Can you try building again from clean to see if it's a stale build file breaking something?

mdhaber commented 3 years ago

I does build when I clone from this repo. Not sure why removing the build folder and git clean -fdx isn't enough. Update: I just tried git clean -fdx again. Worked this time. Whatever.

mdhaber commented 3 years ago

@WarrenWeckesser The Python code looks good to me. There are no errors on my machine. I helped write the test code, and I just double-checked that this closes the issues that it claims to close, so I think I'm done. Did you want to look over the lower-level stuff before this PR is opened on the main repo?

mdhaber commented 3 years ago

Now patched to return inf when x=1, beta<1 (I think that's the right answer, happy to change if not) EDIT: returns 0, not inf

Are you sure about that? I think inf is right for beta < 1. In SciPy currently:

from scipy.stats import beta
beta(a = 2, b = 0.5).pdf(1)

Also see the plots on Wikipedia. And from the definition, (1- x)^(beta - 1) is 0 to a negative power, which is divide by zero.

Were any other fixes like this needed? Also, it doesn't look like it, but were there any modifications (besides the additions) to the tests?

mckib2 commented 3 years ago

Did you want to look over the lower-level stuff before this PR is opened on the main repo?

I'm making some of the cython generation code a little nicer, but other than that it should be good to go on my end. EDIT: done

Are you sure about that? I think inf is right for beta < 1

I can change back to inf, I thought the Boost-side upstream fix used 0 but I may be mistaken. EDIT: changed to INF

Were any other fixes like this needed?

Beta was the only distribution we needed to patch, everything else should be using "normal" wrappers.

Also, it doesn't look like it, but were there any modifications (besides the additions) to the tests?

We only added tests, we didn't modify any. I think there may be some modifications I made to make it easier to include boost as another module that are no longer necessary since we're replacing the distributions, so I can scan through and remove those. EDIT: done

WarrenWeckesser commented 3 years ago

@mdhaber, I'll get back to this tomorrow morning.

mdhaber commented 3 years ago

I can change back to inf, I thought the Boost-side upstream fix used 0 but I may be mistaken.

Well, you're right, it does appear to use 0. I will look closer.

It seems to be a matter of whether the support is (0, 1) or [0, 1]. Wikipedia shows both possibilities; Mathematica (referenced by the Boost dev) uses (0, 1). But SciPy says: image

and the current implementation does use that formula at the endpoints, so I think we would need to return infinity for backwards compatibility, unless we consider this to be a mistake, in which case we'd also need to change the docs. I would actually prefer the latter, because then we wouldn't have to modify the behavior of the boost function.

Wikipedia says:

This definition includes both ends x = 0 and x = 1, which is consistent with definitions for other continuous distributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsine distribution, and consistent with several authors, like N. L. Johnson and S. Kotz.[1][2][3][4] However, the inclusion of x = 0 and x = 1 does not work for α, β < 1; accordingly, several other authors, including W. Feller,[5][6][7] choose to exclude the ends x = 0 and x = 1, (so that the two ends are not actually part of the domain of the density function) and consider instead 0 < x < 1.

@WarrenWeckesser?

Also, seems to me that beta should implement _get_support, but it doesn't (even in SciPy master).

mckib2 commented 3 years ago

@mdhaber @WarrenWeckesser Is this ready to open in scipy?

mdhaber commented 3 years ago

I thought so here. I think we're waiting for the OK from @WarrenWeckesser.

WarrenWeckesser commented 3 years ago

Is this ready to open in scipy?

Yes, go ahead.