pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.73k stars 1.9k forks source link

Inconsistent Results Between Pandas and Polars using cut (and qcut)? #18236

Open pinpss opened 1 month ago

pinpss commented 1 month ago

Checks

Reproducible example

I asked this question on StackOverflow here but am unsure if it’s related to a bug.

I’m switching from Pandas to Polars to create quantile-based portfolios, aiming to categorize a numerical variable into equal-sized portfolios using quantile breakpoints. Therefore, I am using the cut function.

However, I’m seeing discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.

Function for quantile-based binning using Pandas

    import pandas as pd
    import polars as pl
    import numpy as np
    from sklearn.metrics import confusion_matrix

    # Set random seed for reproducibility
    np.random.seed(42)

    def nqtile(df, sorting_variable, nq):
        # Calculate quantile breakpoints
        breakpoints = np.quantile(
            df[sorting_variable].dropna(), 
            np.linspace(0, 1, nq + 1), 
            method="linear"
        )

        # Assign observations to bins
        nqports = pd.cut(
            df[sorting_variable],
            bins=nq,
            labels=range(1, breakpoints.size),
            include_lowest=True,
            right=False
        )

        return nqports

Function for quantile-based binning using Polars

    def nqtile_pl(df, sorting_variable, nq):
        # Calculate quantile breakpoints
        breakpoints = np.quantile(
            df[sorting_variable].drop_nulls(), 
            np.linspace(0, 1, nq + 1), 
            method="linear"
        )

        # Remove the first and last breakpoints for binning
        breakpoints_1 = breakpoints[1:-1]
        labels_list = [str(i) for i in range(1, nq + 1)]

        # Assign observations to bins
        nqports = pl.Series.cut(
            df[sorting_variable],
            breaks=breakpoints_1,
            labels=labels_list,
            left_closed=False
        )
        return nqports

Example

   data = {
        'id': np.random.choice(range(1, 20), size=1000, replace=True),
        'random_number': np.random.randint(1, 100, size=1000)
    }

    # Create a Pandas DataFrame
    df = pd.DataFrame(data)

    # Apply Pandas-based quantile binning
    df['port_Q_pd'] = nqtile(df, 'random_number', 5)
    df['port_Q_pd'] = df['port_Q_pd'].astype(int)

    # Convert to Polars DataFrame
    df_pl = pl.DataFrame(df)

    # Apply Polars-based quantile binning and add to the DataFrame
    df_pl = df_pl.with_columns(
        nqtile_pl(df_pl, 'random_number', 5).alias('port_Q_pl')
    )

    # Convert back to Pandas for comparison
    df_conf = df_pl.to_pandas()

    # Ensure both columns are integers for comparison
    df_conf['port_Q_pl'] = df_conf['port_Q_pl'].astype(int)
    df_conf['port_Q_pd'] = df_conf['port_Q_pd'].astype(int)

    # Calculate the confusion matrix to compare the results
    cm = confusion_matrix(df_conf['port_Q_pl'], df_conf['port_Q_pd'])

    [[207   0   0   0   0]
     [  0 200   0   0   0]
     [  0   0 183  13   0]
     [  0   0   0 184  19]
     [  0   0   0   0 194]]

Log output

No response

Issue description

Discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.

Expected behavior

Same bins for both packages.

Installed versions

``` --------Version info--------- Polars: 1.5.0 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.9.6 (default, Nov 10 2023, 13:38:27) [Clang 15.0.0 (clang-1500.1.0.2.5)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: great_tables: hvplot: matplotlib: 3.9.1 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 17.0.0 pydantic: pyiceberg: sqlalchemy: 2.0.31 torch: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 1 month ago

I'm not too familiar with these functions, but this is an attempt to understand what the actual problem is with a minimal example.

pandas seems to use a specific formula when bins=scalar^1

import pandas as pd
import polars as pl
import numpy as np

random_number = [88, 24, 3, 22, 53, 2, 88, 30]
n_bins = 5

breakpoints = np.quantile(random_number, np.linspace(0, 1, n_bins + 1), method="linear")

breakpoints_pd = np.linspace(min(random_number), max(random_number), n_bins + 1, endpoint=True)
breakpoints_pd[-1] += (max(random_number) - min(random_number)) * 0.0001 # right=False

Which seems to produce different breakpoints:

breakpoints
# array([ 2. , 10.6, 23.6, 34.6, 74. , 88. ])

breakpoints_pd
# array([ 2.    , 19.2   , 36.4   , 53.6   , 70.8   , 88.0086])
pd.cut(pd.Series(random_number), n_bins)
# 0     (70.8, 88.0]
# 1     (19.2, 36.4]
# 2    (1.914, 19.2]
# 3     (19.2, 36.4]
# 4     (36.4, 53.6]
# 5    (1.914, 19.2]
# 6     (70.8, 88.0]
# 7     (19.2, 36.4]

pd.cut(pd.Series(random_number), n_bins, retbins=True)[-1]
# array([ 1.914, 19.2  , 36.4  , 53.6  , 70.8  , 88.   ])

Using cut with breakpoints_pd seems to produce similar results apart from the NaN

pd.cut(pd.Series(random_number), breakpoints_pd)
# 0    (70.8, 88.009]
# 1      (19.2, 36.4]
# 2       (2.0, 19.2]
# 3      (19.2, 36.4]
# 4      (36.4, 53.6]
# 5               NaN
# 6    (70.8, 88.009]
# 7      (19.2, 36.4]

pl.Series(random_number).cut(breakpoints_pd)
# shape: (8,)
# Series: '' [cat]
# [
#   "(70.8, 88.0086]"
#   "(19.2, 36.4]"
#   "(2, 19.2]"
#   "(19.2, 36.4]"
#   "(36.4, 53.599999999999994]"
#   "(-inf, 2]"
#   "(70.8, 88.0086]"
#   "(19.2, 36.4]"
# ]

But I'm not sure if this is expected or not?

mcrumiller commented 1 month ago

There has been a PR in the pipeline for a while that addresses this: #16942.

Edit: that may only be for hist, let me check.

xuJ14 commented 1 month ago

The behavior is discussed thoroughly here #10468