raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.62k stars 139 forks source link

add a "sample_weights" parameter to pingouin.chi2_independence #213

Closed jeanbaptisteb closed 2 years ago

jeanbaptisteb commented 2 years ago

Hi,

It would be great to add the sample weights as a parameter to https://pingouin-stats.org/generated/pingouin.chi2_independence.html.

Currently, it's possible to use a workaround if the sample weights are integers (1, 2, 3, etc.), by simply duplicating rows based on their weight values (see https://stackoverflow.com/questions/32792263/duplicate-row-based-on-value-in-different-column for an example), and then using pingouin.chi2_independence() on the newly generated dataframe.

However, this workaround doesn't work if the weights are floats (e.g. 0.7, 2.8, 13.1, etc.). You could multiply floats by 10 or 100 to transform floats into integers, but it can quickly create memory issues, and more importantly it also impacts the statistical power of the test. So that's probably not a sensible solution.

Scipy's chi2_contingency and pandas' crosstab functions can easily deal with floats in contingency tables, so currently I use the following workflow to compute the chi2 and Cramer's V values, when I have to deal with non-integer weights:

import pandas
from scipy.stats import chi2_contingency
crosstab = pandas.crosstab(
                       df["var1"], 
                       df_pingouin["var2"],
                        df_pingouin["weight"], 
                       aggfunc = sum,
                        )
_chi2, p, dof, expected = chi2_contingency(crosstab)
V = cramersV (crosstab.values, bias_correction=True) #cramersV() is a custom function

I'd love to simplify this workflow without having to rely on my own custom functions.

Thanks.

Great library by the way, it's a nice alternative to statsmodels.

jeanbaptisteb commented 2 years ago

@raphaelvallat As an additional comment, if you have nothing in principle against adding this feature, I'll have a look at the code and try to make a pull request.

raphaelvallat commented 2 years ago

Hi @jeanbaptisteb!

Sorry for the delayed response. This sounds great to me, please do feel free to work on a PR. Do you know any R / Matlab / JASP / SPSS implementation that we could compare Pingouin against? Second, how will this affect the calculation of the Cramer's V and statistical power of the test?

Thanks, Raphael

jeanbaptisteb commented 2 years ago

Hi @raphaelvallat . I dug a bit to see if there were an implementation of that in R. There is at least one, within the survey package, as described here: https://www.displayr.com/the-correct-treatment-of-sampling-weights-in-statistical-tests/ The R implementation is quite straightforward to use, it gives something like: svychisq(~var1+ var2, design= svydesign(id = ~1, weight = ~wgt1, data=df))

However, I'm not super familiar with working with weighted surveys, and it turns out that calculating the chi-square statistic with sample weights is more complicated than I initially thought. According to the link above, when working with weighted samples, you have to apply a correction to the chi-square statistic, called the Rao-Scott correction (some gory mathematical details here or here).

So the example of Python code I gave in my initial comment is certainly the wrong way of doing it, and implementing it in Python might be not as easy as I thought. I also have to dig a bit to see the impact that the Rao-Scott correction certainly has on the Cramer's V statistic.

I'll try to work on an implementation when I'll have enough time on my hand (not before February I'm afraid). I'm not sure if implementing this kind of thing is in the scope of Pingouin (in particular if you want to keep things simple), but it would be valuable if a Python package somewhere offered the Rao-Scott correction for weighted surveys (statsmodels or scipy don't seem to have it).

raphaelvallat commented 2 years ago

Hi @jeanbaptisteb,

Thanks for the detailed explanation! I agree that this might be a little bit too specific for Pingouin, especially if it requires creating a whole new function (and thus cannot be combined with the pg.chi2_independence function).

Keeping this issue open for now in case you want to work on a draft implementation later, but feel free to close anytime.

Thanks, Raphael

jeanbaptisteb commented 2 years ago

Ok, thanks, I'll check this after Winter holiday!

Just as an additional info (as a reminder to myself or for other interested people), I stumbled upon the samplics Python package which implements the Rao-Scott correction for calculating the chi-square statistic (see the "Two-way tabulation (cross-tabulation)" section of the notebook). However, it doesn't include a corrected version of the Cramer's V statistic.

I (superficially) searched on online stats Q&A websites and scholar publications to see how to proceed (e.g. here, here, or here for an example of a scholarly paper reporting corrected chi2 tests alongside Cramer's V, p. 61). To me, it seems a bit unclear if it actually makes sense or not to use the corrected chi2 to compute the Cramer's V statistic.

raphaelvallat commented 2 years ago

Closing this but feel free to reopen!