Open adamrossnelson opened 3 weeks ago
Thanks for the request. There are many statistical tests, and for many tests, there are also many variations of that test. It seems to me it would not be maintainable for pandas to add statistical tests to its API, but rather should provide the functionality to allow the user or third party packages to implement their tests. As such, I'm negative on adding this.
However if there are operations that would make implementing statistical tests easier / more performant, I think it could be considered.
# Use the nanchi2 function from _libs.algos for efficient chi-square calculation chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)
Just to be sure, this function does not yet exist and would also need to be added. I did not see it in your implementation above.
I hope to save this issue from closure. A .chi2()
method seems like a natural extension to the already available .corr()
method. The analyses available through the .corr()
method are rudimentary and among the most fundamental statistical analyses across all of statistics and .corr()
is heavily relied upon by many scientific and analytical professionals.
Before the .corr()
method became an important method on the Series and DataFrame objects in Pandas we could have objected to its inclusion also. Multiple implementations... multiple variations... etc. Today though, it seems inconceivable that Pandas shouldn't include a .corr()
method.
Similarly chi2 analysis is also a widely utilized and fundamental statistical analysis. While Pandas excels at providing analytical options for continuous variables it is has room for growth with regards to categorical variables. Not including a .chi2()
method seems like an oversight and/or a missed opportunity. Just as .corr()
provides a first-pass look at relationships between continuous variables, .chi2()
would offer an equivalent for categorical data. This consistency aligns with Pandas’ goal of providing a comprehensive exploratory data analysis toolkit.
I see your point. Pandas probably can't and also arguably shouldn't strive to provide every conceivable statistical analysis. At present there are only a few other statistical analyses beyond .mean()
... .std()
... etc. The range of methods that are available (once again focus on continuous data) for example: .skew()
... .kurt()
... .sem()
etc... And they're invaluable.
We can also read at the Pandas documentaion that the goal of the project is to "becom[e] the most powerful and flexible open source data analysis/manipulation tool available in any language." cite. For these reasons I sincerely hope that there may be room for further discussion here. Even the cousins to .corr()
such as .cov()
and .corrwith()
are for continuous data. Useful. But they do very little for folks who need or want a quick look at how or if categorical columns may be related.
Also - the code I proposed doesn't provide a full solution. It is a proposed starting point. So if this idea does proceed the code would need additional review.
As such, I hope there may be further discussion and review of this suggestion.
A
.chi2()
method seems like a natural extension to the already available.corr()
method.
At what point does this line of thinking end?
Similarly chi2 analysis is also a widely utilized and fundamental statistical analysis.
I think this is not the metric that should be utilized when determined whether a method should be included in pandas.
I've been thinking about these questions for the past few weeks. I'm not sure I have good answers.
I'm not a maintainer. So the decision to move forward with this proposal is not mine. In the spirit of added conversation and deliberation I would ask if there is any history on how, why, or when Pandas added the .corr()
method? If a .chi2()
method doesn't belong... how did the community decide the .corr()
method does belong? All rhetorical I suppose.
My feelings won't be hurt if this idea gets shelved (for now, or even indefinitely). It also seems that the suggestion hasn't inspired comments from any others (either in support nor against). Perhaps that lack of discussion means there is a lack of enthusiasm for the idea and that, on balance, then weighs in favor of putting this on the "not right now list."
Thanks to @rhshadrach for all the work in grooming this list of issues! Also to all the others who perform similar and contributing work on Pandas!
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
Problem Description
Currently, Pandas does not offer a method for calculating pairwise chi-square tests between columns in
DataFrame
or between twoSeries
objects. Chi-square tests are useful for understanding associations between categorical variables. While correlation methods like.corr()
serve to evaluate relationships among continuous data, there is no equivalent method for categorical data.Researchers and data analysts who work with categorical data currently need to rely on external libraries or custom code to perform chi-square tests across columns in a
DataFrame
or between twoSeries
.Potential Benefits May Include
.corr()
, the.chi2()
method will feel intuitive.Potential Use Cases + Target Users
Using the Titanic data ideal model output could be as follows:
Feature Description
Solution
a
.chi2()
method for bothDataFrame
andSeries
classes would provide efficient and consistent code options that will perform these so-called pairwise chi-square tests (and produce a correlation-matrix-like output we could call or think of ass a so-called chi2-matrix):DataFrame.chi2()
: To perform pairwise chi-square tests for all categorical or integer columns within aDataFrame
, returning a symmetric matrix similar to theDataFrame.corr()
method. Users can choose to output either p-values or chi-square statistics, and an adjustablemax_categories
parameter limits the inclusion of columns with too many unique values.Series.chi2(other_series)
: Performs a chi-square test between twoSeries
objects. It returns either the p-value or chi-square statistic.Both would have optional
verbose
modes to include degrees of freedom values in the output.Potential Code
pandas/core/frame.py
Potential Code
pandas/core/series.py
Potential Code
doc/source/reference/api/pandas.DataFrame.chi2.rst
Potential Code
doc/source/reference/api/pandas.Series.chi2.rst
Potential Code
pandas/tests/frame/methods/test_chi2.py
Potential Code
pandas/tests/frame/methods/test_chi2.py
Alternative Solutions
Currently, to perform chi-square tests on pairs of categorical columns in a
DataFrame
, users can rely on a combination of the following libraries and approaches:Using Scipy’s
chi2_contingency
Functionchi2_contingency
fromscipy.stats
and compute chi-square values using a contingency orpd.crosstab()
for each pair of categorical columns.Example:
pd.crosstab
tables for each pair of columns (or doing so in a loop), making it cumbersome for pairwise analysis across multiple columns. It also lacks an optimized and integrated way to produce pairwise matrices directly within Pandas.Other Third-Party Libraries:
seaborn
orstatsmodels
facilitate chi-square tests and visualizations which may weigh against implementing this in Pandas. However the same can be said for correlation, which is available in many other libraries.Fuilt-in functionality would streamline categorical data analysis within Pandas, aligning with the goal of being a comprehensive tool for data manipulation and analysis.
Additional Context
Searched for related issues, found none. However I may have missed them. Thanks to all in the world of Pandas for consideration, review, and efforts.