Closed t0mst0ne closed 5 years ago
Thanks for doing some legwork here! It seems that scipy.stats
is suboptimal for several reasons. First, they don't compute confidence intervals for most (any?) of the tests using the built-in functions. Second, stats.linregress
does not seem to work for categorical predictors.
I'm currently looking into statsmodels
which uses patsy
, which is really just the R syntax. This may do all the functions that we want to, but because of the large similarity to R, I'm not sure whether it adds enough to offset the cost of doing it. For example, the cheat sheet would be virtually the same.
But I would be VERY happy if someone did it anyway! Whether in this repo (with full credit, of course) or somewhere else. What do you think?
@lindeloev @t0mst0ne I'd be interested in taking this on!
I'm planning on porting this to Python as a side-project, purely for my own education. However, I think there's also value in having a Python port of this project: it will increase the audience of this post, and if nothing else, it will highlight the shortcomings with the Python statistical environment (at least compared to R).
I think it makes most sense to have the port in a separate repo: this resource is already very long, and mixing the code would make the exposition much messier. I've already start work here: https://github.com/eigenfoo/tests-as-linear, and the result is hosted on my blog here: https://eigenfoo.xyz/tests-as-linear. Obviously, still a work in progress!
@lindeloev if you don't mind, I am sure I will eventually have some questions or need some feedback. Would you be willing to take a look at the port when the time comes?
I'm thrilled to see this, @eigenfoo! Yes, I'd be very happy to give some feedback on your version. Looks like you're off to a good start. I am on holiday most of July, but until then (and afterwards), just mention me in a comment, and I'll take a look.
Thanks very much for the R code and explanation of the GLM ! I think pretty cool to let people understand all those statistics in GLM way Though the R code is easy and clear , is it possible to add python code for reference ?
I list below as those I can find (mostly scipy, , but the syntax is NOT as beautiful as R ...
Y ~ continous x
[P] : Pearson correlation : scipy.stats.pearsonr(x, y)
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.pearsonr.html
[N] Spearman correlation : scipy.stats.spearmanr
https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html
Y ~ descrete x
[P] Two-sample t test : scipy.stats.ttest_ind
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
[P] Welch's t-test
Need DIY: https://pythonfordatascience.org/welch-t-test-python-pandas/
[N] Mann-Whitney rank test : scipy.stats.mannwhitneyu
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
Multiple regression : lm(y ~ 1 + x1 + x2 + ...)
[P] : One-way ANOVA : scipy.stats.f_oneway
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html
[N] : Kruskal-Walis : scipy.stats.kruskal
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.kruskal.html
[P] One-way ANCOVA : smf.ols(formula='y ~ a + b + c' , data=df).fit()
[P] Two-way ANOVA : smf.ols(formula= 'y ~ C(a)*C(b)', df).fit()
example https://pythonfordatascience.org/anova-2-way-n-way/
[N] Chi-squared test : scipy.stats.chi2_contingency
https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.chi2_contingency.html
example https://pythonfordatascience.org/chi-square-test-of-independence-python/
[N] Goodness to fit