Open josef-pkt opened 10 years ago
Replication script. This isn't going to be a simple, one-line fix. We don't check anywhere that endog is 1d and we may want string endog for some models (logit/probit/etc.), though we don't handle it yet in from_formula. That's a separate issue. Punting to 0.7.
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict(dict(y=np.random.randint(2, size=10),
x=np.random.randn(10)))
df['y'] = df.y.astype(str)
ols('y ~ x', data=df).fit().params
What's left to do here? after #2013, it seems that this results in an error, which looks fine?
This is still open. The problem is that patsy converts the endog to 2d, which the models (e.g. OLS) cannot handle. (OLS is able to calculate the params but not most of the other results if we have multivariate endog.)
>>> res = ols('y ~ x', data=df).fit()
>>> res.model.endog
array([[ 0., 1.],
[ 1., 0.],
[ 0., 1.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.],
[ 1., 0.],
[ 0., 1.],
[ 1., 0.],
[ 1., 0.]])
>>> print(res.summary())
C:\programs\WinPython-64bit-3.4.3.1\python-3.4.3.amd64\lib\site-packages\scipy\stats\stats.py:1233: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
int(n))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\regression\linear_model.py", line 2030, in summary
top_right = [('R-squared:', ["%#8.3f" % self.rsquared]),
File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\tools\decorators.py", line 97, in __get__
_cachedval = self.fget(obj)
File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\regression\linear_model.py", line 1234, in rsquared
return 1 - self.ssr/self.centered_tss
File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\tools\decorators.py", line 97, in __get__
_cachedval = self.fget(obj)
File "m:\josef_new\eclipse_ws\statsmodels\statsmodels_py34\statsmodels\regression\linear_model.py", line 1206, in ssr
return np.dot(wresid, wresid)
ValueError: shapes (10,2) and (10,2) not aligned: 2 (dim 1) != 10 (dim 0)
>>>
model.__init__
only gets the numeric values.(Aside #2013 might be too strong, unintended exception for Binomial and Multinomial. need to check)
The fix for the issue here needs to work around patsy converting endog that are interpreted as categorical variable.
Shouldn't that then become a patsy parameter?
https://github.com/pydata/patsy/issues/62
I talked with @njsmith at Pycon about a few issues where we would need more options to handle existing cases better or to allow to handle new models better.
(I haven't followed up on these yet, because I have too many things already on my priority list, and it's not my area when I just want to have fun with coding.)
This is also now addressed by #6017. We have a very informative error message about what's going on.
patsy converts object array string endog into dummy variables which are 2d endog
reported on mailing list help on textbook ANOVA example https://groups.google.com/forum/#!topic/pystatsmodels/_rnHIUnx5dM