raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.61k stars 138 forks source link

Two-Way-ANOVA results do not show p-values. #418

Closed FriedrichLidenbach closed 1 month ago

FriedrichLidenbach commented 5 months ago

Sup. I was trying to perform a Two Way Anova on this very simple .CSV file:

block,color,response
a,red,1.9
b,red,2.6
c,red,3.4
(And so on...)
v,red,4.5
w,red,3.4
x,red,1.1
a,green,9.1
b,green,6.4
c,green,1.2
(And so on...)
v,green,17
w,green,9.8
x,green,6.8
a,blue,7.6
b,blue,5.6
c,blue,14
(And so on...)
v,blue,19
w,blue,9
x,blue,6.8

Of course, the (And so on...) entries are mine for brevity and not in the actual file. I can provide the entire file if needed, althought the values follow the pattern shown above. It is saved in this directory: '/home/f/Downloads/color-anova-example.csv . I tried:

import pingouin as pg
import pandas as pd
df = pd.read_csv('/home/f/Downloads/color-anova-example.csv')
anova = pg.anova(dv='response', between=['color', 'block'], data=df)
print(anova)

And got:

/home/f/.local/lib/python3.10/site-packages/pingouin/parametric.py:1093: RuntimeWarning: invalid value encountered in scalar divide
  ms_resid = ss_resid / df_resid

           Source           SS                   DF          MS                 np2
0         color               857.199103     2            428.599551   1.0
1         block              1481.825128   23           64.427179     1.0
2         color * block   514.571297     46           11.186333     1.0
3         Residual         0.000000         0             NaN              NaN

So, no p-unc and that weird warning. I thought maybe there was something wrong with the file, but then I tried running it on Rstudio and using statsmodels and it worked just fine:

R-Studio: Code

data = read.csv('/home/f/Downloads/color-anova-example.csv')
modelo = lm(data$response ~ data$block + data$color)
anova = aov(modelo)
summary(anova)

R-Studio result:

> summary(anova)
            Df Sum Sq Mean Sq F value   Pr(>F)    
data$block  23 1481.8    64.4   5.759 2.42e-07 ***
data$color   2  857.2   428.6  38.315 1.61e-10 ***
Residuals   46  514.6    11.2                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Statsmodels code:

import pingouin as pg
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('/home/f/Downloads/color-anova-example.csv')
model = ols('response ~ C(color) + C(block)', data=df).fit()
anova = sm.stats.anova_lm(model, typ=2)
print(anova)

Statsmodels result:

               sum_sq    df          F        PR(>F)
C(color)   857.199103   2.0  38.314573  1.606109e-10
C(block)  1481.825128  23.0   5.759455  2.415486e-07
Residual   514.571297  46.0        NaN           NaN

[Process exited 0]

What is perplexing is that the ANOVA result doesn't show any errors, a NAN or a nonsensical value, it just omits the p-unc column all together. Weirder is that this problem only presents itself on the Two Way ANOVA, if I remove either of the factors and do a One Way ANOVA, it works just fine:

Code for Pingouin One Way ANOVA:

import pingouin as pg
import pandas as pd
df = pd.read_csv('/home/f/Downloads/color-anova-example.csv')
anova = pg.anova(dv='response', between=['color'], data=df)
print(anova)

Result for Pingouin One Way ANOVA

  Source  ddof1  ddof2          F     p-unc       np2
0  color      2     69  14.813375  0.000004  0.300393

[Process exited 0]

I am using Pingouin 0.5.4 installed trough pip on Linux Mint 21.3 and Kernel 6.5.0-27, on a Dell Latitude Laptop using with Python version 3.10.12.

Besides the error I just wanted to thank you guys fro your work on the project. Pingouin is by far my favorite statistics library. Thanks!

raphaelvallat commented 5 months ago

Hi,

Besides the error I just wanted to thank you guys fro your work on the project. Pingouin is by far my favorite statistics library.

Thanks for the kind words, appreciate it! I only have time to do basic maintenance and Q&A these days unfortunately, so the library hasn't really evolved or improved in two years.

That is such a weird error — I've never seen it. Could you actually share a more complete dataset here? I'd like to see if I can fully reproduce your error. Thanks

FriedrichLidenbach commented 5 months ago

Sure, it is actually a public dataset available in this link:

https://researchguides.library.vanderbilt.edu/ld.php?content_id=19195103

From this site:

https://researchguides.library.vanderbilt.edu/c.php?g=156859&p=1171834

And just in case, this are the complete contents of the file:

block,color,response a,red,1.9 b,red,2.6 c,red,3.4 d,red,0.8 e,red,5.3 f,red,1.5 g,red,4.5 h,red,2.6 i,red,1.16 j,red,1.3 k,red,2 l,red,2 m,red,6.7 n,red,1.44 o,red,2 p,red,2 q,red,1 r,red,1 s,red,1.5 t,red,2 u,red,4.1 v,red,4.5 w,red,3.4 x,red,1.1 a,green,9.1 b,green,6.4 c,green,1.2 d,green,5.7 e,green,17.7 f,green,6.4 g,green,16.6 h,green,8.3 i,green,4.9 j,green,1 k,green,1 l,green,3 m,green,23 n,green,6.13 o,green,9 p,green,2 q,green,4 r,green,4.5 s,green,10 t,green,4 u,green,27.2 v,green,17 w,green,9.8 x,green,6.8 a,blue,7.6 b,blue,5.6 c,blue,14 d,blue,6.8 e,blue,18.5 f,blue,7.2 g,blue,19.5 h,blue,10.5 i,blue,5.27 j,blue,6 k,blue,8 l,blue,7.5 m,blue,23 n,blue,5.8 o,blue,11 p,blue,9 q,blue,6 r,blue,6 s,blue,9.5 t,blue,8 u,blue,25.6 v,blue,19 w,blue,9 x,blue,6.8

raphaelvallat commented 4 months ago

Hey @FriedrichLidenbach

Apologies for the slow reply. The issue comes from the fact that the two-way ANOVA includes an interaction term (color * block) which cannot be calculated in this dataset since there is only one observation for each level of the interaction term (e.g. the combination (a, blue) only has one unique value).

The R and statsmodels code that you provided are additive linear regression models that do not match the standard definition of the two-way factorial anova (see here and here)

Changing the statsmodels code to a factorial model (* instead of +) leads to an error:

image

Thanks, Raphael