pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.84k stars 18k forks source link

ENH: A .chi2() method on the DataFrame and Series class that will resemble the .corr() methods #60111

Open adamrossnelson opened 3 weeks ago

adamrossnelson commented 3 weeks ago

Feature Type

Problem Description

Problem Description

Currently, Pandas does not offer a method for calculating pairwise chi-square tests between columns in DataFrame or between two Series objects. Chi-square tests are useful for understanding associations between categorical variables. While correlation methods like .corr() serve to evaluate relationships among continuous data, there is no equivalent method for categorical data.

Researchers and data analysts who work with categorical data currently need to rely on external libraries or custom code to perform chi-square tests across columns in a DataFrame or between two Series.

Potential Benefits May Include

  1. Swifter Categorical Data Analysis: Enable exploration of associations within categorical data directly within Pandas.
  2. Consistent API: By mimicking the structure and options of .corr(), the .chi2() method will feel intuitive.
  3. Enhanced Efficiency: Avoids the need to transfer data between Pandas and other libraries.
  4. Optimized for Large Datasets: Uses Cython to improve performance, making it feasible to compute pairwise chi-square tests even on large datasets.

Potential Use Cases + Target Users

Using the Titanic data ideal model output could be as follows:

import pandas as pd
import numpy as np
import seaborn as sns

df = sns.load_dataset('titanic')
df.chi2()

              sex  embarked    class      who     deck
sex       0.00000   0.00126  0.00021  0.00000  0.00774
embarked  0.00126   0.00000  0.00000  0.00440  0.05592
class     0.00021   0.00000  0.00000  0.00000  0.00000
who       0.00000   0.00440  0.00000  0.00000  0.00003
deck      0.00774   0.05592  0.00000  0.00003  0.00000

Feature Description

Solution

a .chi2() method for both DataFrame and Series classes would provide efficient and consistent code options that will perform these so-called pairwise chi-square tests (and produce a correlation-matrix-like output we could call or think of ass a so-called chi2-matrix):

Both would have optional verbose modes to include degrees of freedom values in the output.

Potential Code pandas/core/frame.py

from pandas._libs.algos import nanchi2
import numpy as np
import pandas as pd

class DataFrame:
    # Other methods ...

    def chi2(
        self,
        output: str = "p-value",
        max_categories: int = 40,
        verbose: bool = False
    ) -> pd.DataFrame:
        """
        Compute pairwise chi-square analysis of categorical columns, excluding NA/null values.

        Parameters
        ----------
        output : {'p-value', 'chi2stat'}, default 'p-value'
            Determines output format:
            * 'p-value': returns a matrix of p-values from chi-square tests.
            * 'chi2stat': returns a matrix of chi-square statistics. If `verbose=True`,
              each entry is a tuple (chi2_statistic, degrees_of_freedom, p-value).
        max_categories : int, default 40
            Maximum number of unique values allowed for `object` and `int` data types to be included
            in the chi-square calculations. Columns with more than `max_categories` unique values are excluded.
        verbose : bool, default False
            If True and `output="chi2stat"`, each entry in the matrix contains (chi2_statistic, degrees_of_freedom, p-value).

        Returns
        -------
        DataFrame
            Chi-square matrix with pairwise comparisons between columns.

        Raises
        ------
        ValueError
            If the DataFrame contains no columns meeting the criteria for chi-square analysis.

        Notes
        -----
        Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
        values are used. Identical columns return p-value=1.0 and chi2stat=0.0 for optimization.

        Examples
        --------
        >>> import pandas as pd
        >>> df = pd.DataFrame({
        ...     "A": ["dog", "dog", "cat", "dog"],
        ...     "B": ["apple", "orange", "apple", "orange"],
        ...     "C": [1, 2, 1, 2]
        ... })
        >>> df.chi2(output="p-value")
        """

        # Filter columns by dtype and unique values
        valid_columns = [
            col for col in self.columns
            if (self[col].dtype == 'object' or self[col].dtype == 'int' or pd.api.types.is_categorical_dtype(self[col]))
            and self[col].nunique(dropna=True) <= max_categories
        ]
        if not valid_columns:
            raise ValueError(
                "No columns meet the criteria for chi-square analysis. "
                "Ensure categorical, `int`, or `object` columns with fewer than "
                f"{max_categories} unique values are present."
            )

        # Prepare data array with valid columns
        data = self[valid_columns].to_numpy(dtype=float, na_value=np.nan)

        # Use the nanchi2 function from _libs.algos for efficient chi-square calculation
        chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)

        # Handle verbose output for chi2stat
        if output == "chi2stat" and verbose:
            result = pd.DataFrame(index=valid_columns, columns=valid_columns)
            for i, col1 in enumerate(valid_columns):
                for j, col2 in enumerate(valid_columns):
                    if i == j:
                        result.loc[col1, col2] = (0.0, 0, 1.0)  # Identical columns
                    else:
                        chi2_stat = chi2_matrix[i, j]
                        dof = (self[col1].nunique() - 1) * (self[col2].nunique() - 1)
                        p_val = chi2_matrix[i, j]
                        result.loc[col1, col2] = (chi2_stat, dof, p_val)
            return result

        # Convert result to DataFrame for standard output
        result = pd.DataFrame(chi2_matrix, index=valid_columns, columns=valid_columns)
        return result

Potential Code pandas/core/series.py

from pandas._libs.algos import nanchi2
import numpy as np
import pandas as pd

class Series:
    # Other methods ...

    def chi2(
        self,
        other: pd.Series,
        output: str = "p-value",
        max_categories: int = 40,
        verbose: bool = False
    ) -> float:
        """
        Compute chi-square association between this Series and another Series, excluding NA/null values.

        Parameters
        ----------
        other : Series
            The other Series with which to compute the chi-square statistic.
        output : {'p-value', 'chi2stat'}, default 'p-value'
            Determines output format:
            * 'p-value': returns the p-value from the chi-square test.
            * 'chi2stat': returns the chi-square statistic.
        max_categories : int, default 40
            Maximum number of unique values allowed for `object` and `int` data types to be included
            in the chi-square calculations. Series with more than `max_categories` unique values are excluded.
        verbose : bool, default False
            If True, returns a tuple with (chi2_statistic, degrees_of_freedom, p-value). 
            Ignored if `output` is 'p-value'.

        Returns
        -------
        float or tuple
            Chi-square test result. If `output="p-value"`, returns the p-value. 
            If `output="chi2stat"`, returns the chi-square statistic. If `verbose=True`, 
            returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).

        Raises
        ------
        ValueError
            If the Series have incompatible lengths, unsupported data types, or excessive unique values.

        Notes
        -----
        Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
        values are used. Identical Series return p-value=1.0 and chi2stat=0.0 for optimization.

        Examples
        --------
        >>> s1 = pd.Series(["dog", "dog", "cat", "dog"])
        >>> s2 = pd.Series(["apple", "orange", "apple", "orange"])
        >>> s1.chi2(s2, output="p-value")
        """

        # Ensure the other input is a Series and has compatible length
        if not isinstance(other, pd.Series):
            raise TypeError("`other` must be a Series.")
        if len(self) != len(other):
            raise ValueError("Both Series must have the same length.")

        # Check if both Series meet unique value criteria and have supported dtypes
        if (self.nunique(dropna=True) > max_categories or other.nunique(dropna=True) > max_categories):
            raise ValueError(
                "Both Series must have fewer than `max_categories` unique values for chi-square analysis."
            )
        if not (
            pd.api.types.is_categorical_dtype(self) or
            pd.api.types.is_integer_dtype(self) or
            pd.api.types.is_object_dtype(self)
        ):
            raise ValueError("Series must be of type 'int', 'object', or 'category'.")

        if not (
            pd.api.types.is_categorical_dtype(other) or
            pd.api.types.is_integer_dtype(other) or
            pd.api.types.is_object_dtype(other)
        ):
            raise ValueError("`other` must be of type 'int', 'object', or 'category'.")

        # Check if the Series are identical and optimize by returning expected values
        if self.equals(other):
            return 1.0 if output == "p-value" else 0.0

        # Prepare the data as a 2D array for nanchi2 function
        data = np.vstack([self.fillna(np.nan), other.fillna(np.nan)]).T
        chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)

        # Retrieve the appropriate output format
        if output == "p-value":
            return chi2_matrix[0, 1]
        else:
            chi2_stat = chi2_matrix[0, 1]
            dof = (self.nunique() - 1) * (other.nunique() - 1)
            p_val = chi2_matrix[0, 1] if verbose else None

            return (chi2_stat, dof, p_val) if verbose else chi2_stat

Potential Code doc/source/reference/api/pandas.DataFrame.chi2.rst

.. _pandas.DataFrame.chi2:

pandas.DataFrame.chi2
=====================

DataFrame.chi2(output='p-value', max_categories=40) -> DataFrame

Compute pairwise chi-square analysis of categorical columns, excluding NA/null values.

This method calculates the chi-square association between pairs of columns in a DataFrame, comparing categorical columns or those with a limited number of unique values (default: 40). The output can either be a matrix of p-values or chi-square statistics.

Parameters
----------
output : {'p-value', 'chi2stat'}, default 'p-value'
    Determines output format:
    * 'p-value': returns a matrix of p-values from chi-square tests.
    * 'chi2stat': returns a matrix of chi-square statistics with degrees of freedom.

max_categories : int, default 40
    Maximum number of unique values allowed for `object` and `int` data types to be included
    in the chi-square calculations. Columns with more than `max_categories` unique values are excluded.

Returns
-------
DataFrame
    Symmetric chi-square matrix with pairwise comparisons between columns.

Raises
------
ValueError
    If the DataFrame contains no columns meeting the criteria for chi-square analysis.

Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are included. Identical columns return p-value=1.0 and chi2stat=0.0 for optimization.

Examples
--------
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "A": ["dog", "dog", "cat", "dog"],
...     "B": ["apple", "orange", "apple", "orange"],
...     "C": [1, 2, 1, 2]
... })
>>> df.chi2(output="p-value")
          A         B         C
A  1.000000  0.300000  0.200000
B  0.300000  1.000000  0.150000
C  0.200000  0.150000  1.000000

See Also
--------
pandas.DataFrame.corr : Compute pairwise correlation of columns.
pandas.DataFrame.corrwith : Compute pairwise correlation with another DataFrame or Series.
pandas.Series.chi2 : Compute chi-square association with another Series.

Potential Code doc/source/reference/api/pandas.Series.chi2.rst

.. _pandas.Series.chi2:

pandas.Series.chi2
==================

Series.chi2(other, output='p-value', max_categories=40, verbose=False) -> float or tuple

Compute chi-square association between this Series and another Series, excluding NA/null values.

This method calculates the chi-square association between two Series, comparing categorical values or those with a limited number of unique values (default: 40). The output can be the p-value, the chi-square statistic, or additional details if `verbose=True`.

Parameters
----------
other : Series
    The other Series with which to compute the chi-square statistic.
output : {'p-value', 'chi2stat'}, default 'p-value'
    Determines output format:
    * 'p-value': returns the p-value from the chi-square test.
    * 'chi2stat': returns the chi-square statistic.
max_categories : int, default 40
    Maximum number of unique values allowed for `object` and `int` data types to be included
    in the chi-square calculations. Series with more than `max_categories` unique values are excluded.
verbose : bool, default False
    If True and `output="chi2stat"`, returns a tuple with (chi2_statistic, degrees_of_freedom, p-value).

Returns
-------
float or tuple
    Chi-square test result. If `output="p-value"`, returns the p-value. 
    If `output="chi2stat"`, returns the chi-square statistic. If `verbose=True` 
    with `output="chi2stat"`, returns a tuple (chi2_statistic, degrees_of_freedom, p-value).

Raises
------
ValueError
    If the Series have incompatible lengths, unsupported data types, or excessive unique values.

Notes
-----
Only categorical data and `int`/`object` data types with fewer than `max_categories` unique
values are used. Identical Series return p-value=1.0 and chi2stat=0.0 for optimization.

Examples
--------
>>> import pandas as pd
>>> s1 = pd.Series(["dog", "dog", "cat", "dog"])
>>> s2 = pd.Series(["apple", "orange", "apple", "orange"])
>>> s1.chi2(s2, output="p-value")
0.300000

See Also
--------
pandas.Series.corr : Compute correlation with another Series.
pandas.DataFrame.chi2 : Compute pairwise chi-square association between columns of a DataFrame.
pandas.Series.chi2 : Compute chi-square association with another Series.

Potential Code pandas/tests/frame/methods/test_chi2.py

import numpy as np
import pytest
import pandas as pd
from pandas import DataFrame
import pandas._testing as tm

class TestDataFrameChi2:
    def test_chi2_basic(self):
        # Test basic functionality with categorical data
        df = DataFrame({
            "A": ["dog", "dog", "cat", "dog"],
            "B": ["apple", "orange", "apple", "orange"],
            "C": [1, 2, 1, 2]
        })
        result = df.chi2()
        assert result.shape == (3, 3)
        assert result.index.equals(df.columns)
        assert result.columns.equals(df.columns)

    def test_chi2_output_p_value(self):
        # Test output="p-value"
        df = DataFrame({
            "A": ["yes", "no", "yes", "yes"],
            "B": ["high", "low", "medium", "medium"],
            "C": [1, 2, 1, 3]
        })
        result = df.chi2(output="p-value")
        assert result.shape == (3, 3)
        assert result.loc["A", "B"] >= 0  # p-value range check

    def test_chi2_output_chi2stat(self):
        # Test output="chi2stat"
        df = DataFrame({
            "A": ["up", "down", "up", "down"],
            "B": ["high", "medium", "medium", "low"],
            "C": [1, 2, 2, 1]
        })
        result = df.chi2(output="chi2stat")
        assert result.shape == (3, 3)
        assert isinstance(result.loc["A", "B"], float)  # Check statistic is a float

    def test_chi2_max_categories(self):
        # Test max_categories threshold
        df = DataFrame({
            "A": ["cat" + str(i) for i in range(50)],  # Exceeds default max_categories of 40
            "B": ["type" + str(i % 3) for i in range(50)]
        })
        with pytest.raises(ValueError, match="No columns meet the criteria for chi-square analysis"):
            df.chi2()

    def test_chi2_na_handling(self):
        # Test handling of NaNs
        df = DataFrame({
            "A": ["yes", "no", np.nan, "yes"],
            "B": ["high", np.nan, "medium", "medium"],
            "C": [1, 2, 1, np.nan]
        })
        result = df.chi2(output="p-value")
        assert result.loc["A", "B"] >= 0  # p-value should be non-negative
        assert np.isnan(result.loc["A", "C"])  # Row with NaNs should yield NaN

    def test_chi2_identical_columns(self):
        # Test optimization for identical columns
        df = DataFrame({
            "A": ["dog", "dog", "cat", "dog"],
            "B": ["dog", "dog", "cat", "dog"],
            "C": [1, 2, 1, 2]
        })
        result = df.chi2(output="p-value")
        assert result.loc["A", "B"] == 1.0  # Identical columns should return p-value=1.0

    def test_chi2_non_categorical_data(self):
        # Test error handling for non-categorical data
        df = DataFrame({
            "A": [1.5, 2.5, 3.5, 4.5],  # Continuous numeric data
            "B": ["apple", "orange", "apple", "orange"],
            "C": ["yes", "no", "yes", "yes"]
        })
        with pytest.raises(ValueError, match="must be of type 'int', 'object', or 'category'"):
            df.chi2()

    def test_chi2_single_column(self):
        # Test single column DataFrame
        df = DataFrame({
            "A": ["dog", "dog", "cat", "dog"]
        })
        result = df.chi2()
        assert result.shape == (1, 1)
        assert result.loc["A", "A"] == 1.0  # Single column should return p-value=1.0

Potential Code pandas/tests/frame/methods/test_chi2.py

import numpy as np
import pytest
import pandas as pd
from pandas import Series
import pandas._testing as tm

class TestSeriesChi2:
    def test_chi2_basic(self):
        # Basic functionality with categorical data
        s1 = Series(["dog", "dog", "cat", "dog"])
        s2 = Series(["apple", "orange", "apple", "orange"])
        result = s1.chi2(s2)
        assert isinstance(result, float)  # Expecting a single p-value

    def test_chi2_output_p_value(self):
        # Test output="p-value" explicitly
        s1 = Series(["yes", "no", "yes", "yes"])
        s2 = Series(["high", "low", "medium", "medium"])
        result = s1.chi2(s2, output="p-value")
        assert 0 <= result <= 1  # p-value should be within this range

    def test_chi2_output_chi2stat(self):
        # Test output="chi2stat"
        s1 = Series(["up", "down", "up", "down"])
        s2 = Series(["high", "medium", "medium", "low"])
        result = s1.chi2(s2, output="chi2stat")
        assert isinstance(result, float)  # Expecting chi-square statistic as a float

    def test_chi2_verbose_output(self):
        # Test verbose output for chi2stat
        s1 = Series(["yes", "no", "yes", "yes"])
        s2 = Series(["high", "low", "medium", "medium"])
        result = s1.chi2(s2, output="chi2stat", verbose=True)
        assert isinstance(result, tuple)  # Should return tuple in verbose mode
        assert len(result) == 3  # Tuple should contain (chi2_statistic, degrees_of_freedom, p-value)

    def test_chi2_max_categories(self):
        # Test max_categories threshold
        s1 = Series(["cat" + str(i) for i in range(50)])  # Exceeds default max_categories of 40
        s2 = Series(["type" + str(i % 3) for i in range(50)])
        with pytest.raises(ValueError, match="must have fewer than `max_categories` unique values"):
            s1.chi2(s2)

    def test_chi2_na_handling(self):
        # Test handling of NaNs
        s1 = Series(["yes", "no", np.nan, "yes"])
        s2 = Series(["high", np.nan, "medium", "medium"])
        result = s1.chi2(s2, output="p-value")
        assert 0 <= result <= 1 or np.isnan(result)  # Allow p-value or NaN

    def test_chi2_identical_series(self):
        # Test optimization for identical Series
        s1 = Series(["dog", "dog", "cat", "dog"])
        s2 = s1.copy()  # Identical Series
        result = s1.chi2(s2, output="p-value")
        assert result == 1.0  # Identical series should return p-value=1.0

    def test_chi2_non_categorical_data(self):
        # Test error handling for non-categorical data
        s1 = Series([1.5, 2.5, 3.5, 4.5])  # Continuous numeric data
        s2 = Series(["apple", "orange", "apple", "orange"])
        with pytest.raises(ValueError, match="must be of type 'int', 'object', or 'category'"):
            s1.chi2(s2)

    def test_chi2_mismatched_lengths(self):
        # Test error handling for mismatched Series lengths
        s1 = Series(["dog", "dog", "cat", "dog"])
        s2 = Series(["apple", "orange", "apple"])  # Mismatched length
        with pytest.raises(ValueError, match="Both Series must have the same length"):
            s1.chi2(s2)

Alternative Solutions

Currently, to perform chi-square tests on pairs of categorical columns in a DataFrame, users can rely on a combination of the following libraries and approaches:

Using Scipy’s chi2_contingency Function

Other Third-Party Libraries:

Fuilt-in functionality would streamline categorical data analysis within Pandas, aligning with the goal of being a comprehensive tool for data manipulation and analysis.

Additional Context

Searched for related issues, found none. However I may have missed them. Thanks to all in the world of Pandas for consideration, review, and efforts.

rhshadrach commented 3 weeks ago

Thanks for the request. There are many statistical tests, and for many tests, there are also many variations of that test. It seems to me it would not be maintainable for pandas to add statistical tests to its API, but rather should provide the functionality to allow the user or third party packages to implement their tests. As such, I'm negative on adding this.

However if there are operations that would make implementing statistical tests easier / more performant, I think it could be considered.

# Use the nanchi2 function from _libs.algos for efficient chi-square calculation
chi2_matrix = nanchi2(data, max_categories=max_categories, output=output)

Just to be sure, this function does not yet exist and would also need to be added. I did not see it in your implementation above.

adamrossnelson commented 3 weeks ago

I hope to save this issue from closure. A .chi2() method seems like a natural extension to the already available .corr() method. The analyses available through the .corr() method are rudimentary and among the most fundamental statistical analyses across all of statistics and .corr() is heavily relied upon by many scientific and analytical professionals.

Before the .corr() method became an important method on the Series and DataFrame objects in Pandas we could have objected to its inclusion also. Multiple implementations... multiple variations... etc. Today though, it seems inconceivable that Pandas shouldn't include a .corr() method.

Similarly chi2 analysis is also a widely utilized and fundamental statistical analysis. While Pandas excels at providing analytical options for continuous variables it is has room for growth with regards to categorical variables. Not including a .chi2() method seems like an oversight and/or a missed opportunity. Just as .corr() provides a first-pass look at relationships between continuous variables, .chi2() would offer an equivalent for categorical data. This consistency aligns with Pandas’ goal of providing a comprehensive exploratory data analysis toolkit.

I see your point. Pandas probably can't and also arguably shouldn't strive to provide every conceivable statistical analysis. At present there are only a few other statistical analyses beyond .mean() ... .std() ... etc. The range of methods that are available (once again focus on continuous data) for example: .skew() ... .kurt() ... .sem() etc... And they're invaluable.

We can also read at the Pandas documentaion that the goal of the project is to "becom[e] the most powerful and flexible open source data analysis/manipulation tool available in any language." cite. For these reasons I sincerely hope that there may be room for further discussion here. Even the cousins to .corr() such as .cov() and .corrwith() are for continuous data. Useful. But they do very little for folks who need or want a quick look at how or if categorical columns may be related.

Also - the code I proposed doesn't provide a full solution. It is a proposed starting point. So if this idea does proceed the code would need additional review.

As such, I hope there may be further discussion and review of this suggestion.

rhshadrach commented 3 weeks ago

A .chi2() method seems like a natural extension to the already available .corr() method.

At what point does this line of thinking end?

Similarly chi2 analysis is also a widely utilized and fundamental statistical analysis.

I think this is not the metric that should be utilized when determined whether a method should be included in pandas.

adamrossnelson commented 1 week ago

I've been thinking about these questions for the past few weeks. I'm not sure I have good answers.

I'm not a maintainer. So the decision to move forward with this proposal is not mine. In the spirit of added conversation and deliberation I would ask if there is any history on how, why, or when Pandas added the .corr() method? If a .chi2() method doesn't belong... how did the community decide the .corr() method does belong? All rhetorical I suppose.

My feelings won't be hurt if this idea gets shelved (for now, or even indefinitely). It also seems that the suggestion hasn't inspired comments from any others (either in support nor against). Perhaps that lack of discussion means there is a lack of enthusiasm for the idea and that, on balance, then weighs in favor of putting this on the "not right now list."

Thanks to @rhshadrach for all the work in grooming this list of issues! Also to all the others who perform similar and contributing work on Pandas!