Open pinpss opened 1 month ago
I'm not too familiar with these functions, but this is an attempt to understand what the actual problem is with a minimal example.
pandas seems to use a specific formula when bins=scalar
^1
import pandas as pd
import polars as pl
import numpy as np
random_number = [88, 24, 3, 22, 53, 2, 88, 30]
n_bins = 5
breakpoints = np.quantile(random_number, np.linspace(0, 1, n_bins + 1), method="linear")
breakpoints_pd = np.linspace(min(random_number), max(random_number), n_bins + 1, endpoint=True)
breakpoints_pd[-1] += (max(random_number) - min(random_number)) * 0.0001 # right=False
Which seems to produce different breakpoints:
breakpoints
# array([ 2. , 10.6, 23.6, 34.6, 74. , 88. ])
breakpoints_pd
# array([ 2. , 19.2 , 36.4 , 53.6 , 70.8 , 88.0086])
pd.cut(pd.Series(random_number), n_bins)
# 0 (70.8, 88.0]
# 1 (19.2, 36.4]
# 2 (1.914, 19.2]
# 3 (19.2, 36.4]
# 4 (36.4, 53.6]
# 5 (1.914, 19.2]
# 6 (70.8, 88.0]
# 7 (19.2, 36.4]
pd.cut(pd.Series(random_number), n_bins, retbins=True)[-1]
# array([ 1.914, 19.2 , 36.4 , 53.6 , 70.8 , 88. ])
Using cut
with breakpoints_pd
seems to produce similar results apart from the NaN
pd.cut(pd.Series(random_number), breakpoints_pd)
# 0 (70.8, 88.009]
# 1 (19.2, 36.4]
# 2 (2.0, 19.2]
# 3 (19.2, 36.4]
# 4 (36.4, 53.6]
# 5 NaN
# 6 (70.8, 88.009]
# 7 (19.2, 36.4]
pl.Series(random_number).cut(breakpoints_pd)
# shape: (8,)
# Series: '' [cat]
# [
# "(70.8, 88.0086]"
# "(19.2, 36.4]"
# "(2, 19.2]"
# "(19.2, 36.4]"
# "(36.4, 53.599999999999994]"
# "(-inf, 2]"
# "(70.8, 88.0086]"
# "(19.2, 36.4]"
# ]
But I'm not sure if this is expected or not?
There has been a PR in the pipeline for a while that addresses this: #16942.
Edit: that may only be for hist
, let me check.
The behavior is discussed thoroughly here #10468
Checks
Reproducible example
I asked this question on StackOverflow here but am unsure if it’s related to a bug.
I’m switching from Pandas to Polars to create quantile-based portfolios, aiming to categorize a numerical variable into equal-sized portfolios using quantile breakpoints. Therefore, I am using the cut function.
However, I’m seeing discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.
Function for quantile-based binning using Pandas
Function for quantile-based binning using Polars
Example
Log output
No response
Issue description
Discrepancies in the bins generated by Pandas and Polars, resulting in inconsistent outcomes between the two implementations.
Expected behavior
Same bins for both packages.
Installed versions