scipy / scipy

SciPy library main repository
https://scipy.org
BSD 3-Clause "New" or "Revised" License
13.06k stars 5.18k forks source link

stats.scipy.multinomial is not vectorized? #11422

Open alejandroschuler opened 4 years ago

alejandroschuler commented 4 years ago

It appears that stats.scipy.multinomial does not support vectorized operations in a way that is consistent with other scipy.stats distributions that I have used:

from scipy.stats import binom, multinomial

# 3 binomial RVs, each with n=1, but with different probabilities of success
pbinom = np.array([0, 0, 1]) 
xbinom = binom(n=1,p=pbinom)
xbinom.rvs()
>>> array([0, 0, 1]) # returns the result of the 3 "coin flips"

# 3 multinomial RVs, each with n=1, but with different probabilities of the (3) classes
pmulti = np.array([[0,1,0],[1,0,0],[0,0,1]])
xmulti = multinomial(n=1,p=pmulti)
xmulti.rvs()
>>> array([[0, 1, 0]]) # returns the result of only the first "die roll"

# expected: 
# array([[0,1,0],
#            [1,0,0],
#            [0,0,1]])

It also seems that n can't be vectorized:

xmulti = multinomial(n=np.array([1,2]), p=np.array([[0,0,1],[1,0,0]]))
xmulti.rvs()

returns an error, whereas

pbinom = np.array([0, 0, 1])
xbinom = binom(n=np.array([1,1,5]),p=pbinom)
xbinom.rvs()
>>> array([0, 0, 5])

does exactly what one would expect.

pvanmulbregt commented 4 years ago

multinomial.rvs() calls random_state.multinomial(n, p, size) https://github.com/scipy/scipy/blob/7e3cda70f5448bc2afa019d7b49533eaba33090e/scipy/stats/_multivariate.py#L3218-L3220

random_state.multinomial doesn't appear to deal with the multi-dimensional pmulti. I.e. It returns only the values corresponding to the first row of pmulti

pmulti = np.array([[0,1,0],[1,0,0],[0,0,1]])
random_state = np.random.mtrand._rand
random_state.multinomial(1, pmulti, None) 
#  array([0, 1, 0])
random_state.multinomial(1, pmulti[0], None) 
#  array([0, 1, 0])
random_state.multinomial(1, pmulti[2], None) 
#  array([0, 0, 1])

# What you would like is the equivalent of
np.array([rs2.multinomial(1, _p, None)  for _p in pmulti])
# array([[0, 1, 0],
#        [1, 0, 0],
#        [0, 0, 1]])

The multinomial docstring is https://github.com/scipy/scipy/blob/7e3cda70f5448bc2afa019d7b49533eaba33090e/scipy/stats/_multivariate.py#L2949-L2952

It gives a list of some methods that support broadcasting and rvs is not on that list. There is a comment in the original PR https://github.com/scipy/scipy/pull/6375#issuecomment-233080752 that suggests this issue was known at the time.

alejandroschuler commented 4 years ago

@pvanmulbregt thanks for looking into it! I didn't assume this was a bug per se, but I wanted to point out that regardless of the reason, the behavior of .rvs is inconsistent in this case in a way that makes building generic packages on top of scipy.stats difficult. It's especially weird now that I know that .pmf actually does support broadcasting! :)

ev-br commented 4 years ago

@alejandroschuler would you be interested in submitting a PR with a fix?

andyfaff commented 4 years ago

FYI: in one of the examples presented np.random.Generator appears to broadcast. Is this the correct behaviour?

>>> rg = np.random.default_rng()
>>> rg.multinomial(n=np.array([1, 2]), pvals=np.array([[0, 0, 1], [1, 0, 0]]))
array([[0, 1],
       [0, 2]])
pvanmulbregt commented 4 years ago

Upon further investigation, multinomial is not taking the first row. When passed an array of shape (m,p), it decides the number of possible values is m (not p), and takes the first (m-1) elements of the array and uses those. This selection may cross rows. E.g.

In[86]: pmulti = np.array([[0,1,0],[0.5,0,0.5],[0,0,1], [0.1, 0.1, 0.8], [0.2, 0.2, 0.6]])
In [87]: pmulti                                                                                                                                                                                                     
Out[87]: 
array([[0. , 1. , 0. ],
       [0.5, 0. , 0.5],
       [0. , 0. , 1. ],
       [0.1, 0.1, 0.8],
       [0.2, 0.2, 0.6]])

In [88]: np.shape(pmulti)                                                                                                                                                                                           
Out[88]: (5, 3)

In [89]: rng.multinomial(20, pmulti)                                                                                                                                                                                
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-85-541ce7a1994a> in <module>
----> 1 rng.multinomial(20, pmulti)

generator.pyx in numpy.random.generator.Generator.multinomial()

ValueError: sum(pvals[:-1]) > 1.0

The error occurs because it treated the (5,3) array as an array of length 5, and used 3 values from row 0, and 2 values from row 1. It summed the first 4 elements of [0, 1, 0, 0.5, 0, 0.5, ..., 0.2, 0.2, 0.6] to get 1.5 which is bigger than 1, so raised an Exception.

alejandroschuler commented 4 years ago

@ev-br I'm not at all familiar with the scipy.stats codebase so I'm probably not the right person. It looks like another issue has been raised (https://github.com/numpy/numpy/issues/15492) that's more specific so hopefully folks will narrow in on the issue and resolve it there.

The reason I bring this up is because I'm one of the authors of the newly-popular NGBoost package. Right now, the process for developing a new distribution to add to NGBoost is not as user-friendly as I'd like it to be. In my ideal world, it would be as easy as pointing to an implementation in scipy.stats, but I can't quite get there because of some inconsistencies and my unfamiliarity with some necessary internals.

By any chance do you know if there someone who is an involved contributor to scipy.stats that might be willing to email or chat with me about a tighter integration between NGBoost and scipy.stats?