statsmodels / statsmodels

Statsmodels: statistical modeling and econometrics in Python
http://www.statsmodels.org/devel/
BSD 3-Clause "New" or "Revised" License
10.11k stars 2.88k forks source link

dtypes, check for object arrays in endog #1210

Open josef-pkt opened 10 years ago

josef-pkt commented 10 years ago

patsy converts object array string endog into dummy variables which are 2d endog

reported on mailing list help on textbook ANOVA example https://groups.google.com/forum/#!topic/pystatsmodels/_rnHIUnx5dM

jseabold commented 10 years ago

Replication script. This isn't going to be a simple, one-line fix. We don't check anywhere that endog is 1d and we may want string endog for some models (logit/probit/etc.), though we don't handle it yet in from_formula. That's a separate issue. Punting to 0.7.

from statsmodels.formula.api import ols
import pandas as pd
import numpy as np

df = pd.DataFrame.from_dict(dict(y=np.random.randint(2, size=10),
                                 x=np.random.randn(10)))
df['y'] = df.y.astype(str)

ols('y ~ x', data=df).fit().params
jankatins commented 9 years ago

What's left to do here? after #2013, it seems that this results in an error, which looks fine?

josef-pkt commented 9 years ago

This is still open. The problem is that patsy converts the endog to 2d, which the models (e.g. OLS) cannot handle. (OLS is able to calculate the params but not most of the other results if we have multivariate endog.)

>>> res = ols('y ~ x', data=df).fit()
>>> res.model.endog
array([[ 0.,  1.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 1.,  0.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.]])
>>> print(res.summary())
C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\scipy\stats\stats.py:1233: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  int(n))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\regression\linear_model.py", line 2030, in summary
    top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
  File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\tools\decorators.py", line 97, in __get__
    _cachedval = self.fget(obj)
  File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\regression\linear_model.py", line 1234, in rsquared
    return 1 - self.ssr/self.centered_tss
  File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\tools\decorators.py", line 97, in __get__
    _cachedval = self.fget(obj)
  File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\regression\linear_model.py", line 1206, in ssr
    return np.dot(wresid, wresid)
ValueError: shapes (10,2) and (10,2) not aligned: 2 (dim 1) != 10 (dim 0)
>>> 

2013 doesn't address object arrays that are converted by patsy to numeric, so our model.__init__ only gets the numeric values.

(Aside #2013 might be too strong, unintended exception for Binomial and Multinomial. need to check)

The fix for the issue here needs to work around patsy converting endog that are interpreted as categorical variable.

jankatins commented 9 years ago

Shouldn't that then become a patsy parameter?

josef-pkt commented 9 years ago

https://github.com/pydata/patsy/issues/62

I talked with @njsmith at Pycon about a few issues where we would need more options to handle existing cases better or to allow to handle new models better.

(I haven't followed up on these yet, because I have too many things already on my priority list, and it's not my area when I just want to have fun with coding.)

jseabold commented 3 years ago

This is also now addressed by #6017. We have a very informative error message about what's going on.