raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.6k stars 137 forks source link

Update pairwise.py to function with Pandas >= 2.0.0 #394

Closed Cortexan closed 9 months ago

Cortexan commented 9 months ago

In the current version of pairwise.py, pandas.core.groupby.DataFrameGroupBy.mean is utilized with default parameters:

# on line 471:
        stats = pd.DataFrame()
        for i, f in enumerate(factors):
            # Introduced in Pingouin v0.3.2
            # Note that is only has an impact in the between test of mixed
            # designs. Indeed, a similar groupby is applied by default on
            # each within-subject factor of a two-way repeated measures design.
            if all([agg[i], marginal]):
                tmp = data.groupby([subject, f], as_index=False, observed=True, sort=True).mean()
# ...

However, the default parameters of pandas.core.groupby.DataFrameGroupBy.mean have changed in Pandas >= 2.0.0, such that 'numeric_only' defaults to 'False' - see here.

If Pandas >= 2.0.0 is used, this results in dtype errors when applying pingouin.pairwise_test to pandas.DataFrames containing non-integer series.

This is solved by setting the 'numeric_only' parameter of pandas.core.groupby.DataFrameGroupBy.mean to 'True':

# on line 471:
        stats = pd.DataFrame()
        for i, f in enumerate(factors):
            # Introduced in Pingouin v0.3.2
            # Note that is only has an impact in the between test of mixed
            # designs. Indeed, a similar groupby is applied by default on
            # each within-subject factor of a two-way repeated measures design.
            if all([agg[i], marginal]):
                tmp = data.groupby([subject, f], as_index=False, observed=True, sort=True).mean(numeric_only = True)
# ...

To reproduce the error, run the snippet below first with Pandas <= 1.5.3, then with Pandas >= 2.0.0:

import pandas as pd
import pingouin as pg
import random

test = pd.DataFrame()
test['subj'] = [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
test['strings'] = ['a','a','b','b']*4
test['string_ints'] = [0,0,1,1]*4
test['ints'] = [1,2,1,2]*4
test['score'] = [random.randint(20,40) for i in range(16)]

t_test = pg.pairwise_tests(data = test,
                           dv = 'score',
                           subject = 'subj',
                           within = ['strings', 'ints'])
Cortexan commented 9 months ago

Noticing now you've already made similar changes to all DataFrame.corr() and DataFrame.cov() instances... so it's the same root issue also for instances of DataFrame.mean() / DataFrame.groupby.mean().

raphaelvallat commented 9 months ago

Thank you! Indeed this needs to be updated as well. Please feel free to submit a PR if you'd like.