pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.69k stars 17.92k forks source link

ENH: pd.cut closed intervals #51534

Open Gabriel-Kissin opened 1 year ago

Gabriel-Kissin commented 1 year ago

Feature Type

Problem Description

pd.cut and pd.qcut create intervals and partition the dataset according to those intervals. The intervals are generally open on the left and closed on the right, and with continuous data are fine.

However with discrete data this can sometimes give slightly ridiculous intervals. Integers are a case in point. If we have numbers from 0 to 99, pd.cut makes intervals like (-0.099, 14.1429], (14.1429, 28.2857].

Here is a MRE:

import pandas as pd
import numpy as np

interval_testing = pd.DataFrame(columns=['data', 'interval'],)

interval_testing.data = np.arange(0,100).astype(int)

interval_testing.interval = pd.cut(interval_testing.data, bins=7, precision=4, )
# interval_testing.interval = pd.qcut(interval_testing.data, q=7, precision=5, )

interval_testing.groupby('interval').aggregate(['min', 'max', 'count'])

which outputs the following:
image

It would be great if there was an option to specify that data is discrete, so that intervals would be like (14, 28], (28, 42] etc. (Not just integers - for example data which is measured to one dp, it would give (14.3, 28.7], (28.7, 42.4] or similar). The level of discrete-ness can be inferred from the data. [EDIT: ATM you can achieve sthg similar using the precision= parameter, but this doesn't automatically infer from the data. The main point of this suggestion is the next point, that intervals should be closed].

A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.

For example if the data is "number of times something happened", intervals like those described would be more intuitive.

It isn't particularly hard to work round this, but this might be a useful feature to add.

Feature Description

Some rough code:

import pandas as pd
import numpy as np

def integer_qcut(x, q):
    binned_df, bins = pd.qcut(x, q, duplicates='drop', retbins=True)
    bins = np.floor(bins).astype(int)
    bins_left  = bins[:-1]
    bins_right = bins[1:] - np.array([1]*(len(bins)-2) + [0])
    bins = pd.IntervalIndex.from_arrays(left=bins_left, right=bins_right, closed='both')
    return pd.cut(x=x, bins=bins)
    # return  bins, #quantiles

interval_testing = pd.DataFrame(columns=['data', 'interval'],)
interval_testing.data = np.arange(0,100).astype(int)
interval_testing.interval = integer_qcut(interval_testing.data, q=7,  )

interval_testing.groupby('interval').aggregate(['min', 'max', 'count'])

gives

image

Not perfect because the final group is larger than the others (as a result of using np.floor), but illustrates what I'm getting at.

Alternative Solutions

potentially multiple ways of solving this...

Additional Context

No response

rhshadrach commented 1 year ago

Thanks for the request.

The level of discrete-ness can be inferred from the data.

Can you give some details here?

A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.

I think this is independent of discrete bins; I would recommend making this a separate issue. But I don't understand what cannot be accomplished with half-open bins that can be accomplished with closed bins here, though that discussion is probably best for a separate issue.

Not perfect because the final group is larger than the others (as a result of using np.floor), but illustrates what I'm getting at.

Is there a proposed algorithm?

Gabriel-Kissin commented 1 year ago

The level of discrete-ness can be inferred from the data.

Can you give some details here?

I was imagining an option where by default data is continuous; but it can either be specified a certain level of discreteness, or set to 'infer'. If set to 'infer', it could do sthg along these lines:

discrete_data = np.linspace(0,10,200)
smallest_diff = np.diff(np.sort(discrete_data)).min()
inferred_precision = np.floor(np.log10(smallest_diff))
level_discreteness = 10 ** inferred_precision
print(smallest_diff, inferred_precision, level_discreteness, sep='\t')
# outputs 0.05025125628140614   -2.0    0.01

Looking over my post again, I see I forgot to mention that the precision parameter of pd.cut can be used to make bins at the specified precision. So using the above data, if we want it in 7 bins:

pd.cut(discrete_data, bins=7, )
# [(-0.01, 1.429], (-0.01, 1.429], (-0.01, 1.429], (-0.01, 1.429], (-0.01, 1.429], ..., (8.571, 10.0], (8.571, 10.0], (8.571, 10.0], (8.571, 10.0], (8.571, 10.0]]
pd.cut(discrete_data, bins=7, precision=-int(inferred_precision)) 
# [(-0.01, 1.43], (-0.01, 1.43], (-0.01, 1.43], (-0.01, 1.43], (-0.01, 1.43], ..., (8.57, 10.0], (8.57, 10.0], (8.57, 10.0], (8.57, 10.0], (8.57, 10.0]]

- the second time has more appropriate bins.

Gabriel-Kissin commented 1 year ago

A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.

I think this is independent of discrete bins; I would recommend making this a separate issue. But I don't understand what cannot be accomplished with half-open bins that can be accomplished with closed bins here, though that discussion is probably best for a separate issue.

Discrete bins can be accomplished through the precision parameter (although that has its own issues, https://github.com/pandas-dev/pandas/issues/51532). Apologies for not mentioning this last time. The point of this suggested improvement is to allow closed bins, as per the post title.

The advantage of closed bins is that to me it just looks neater. Particularly with a feature representing a count, e.g. how many times each person works out a week, binning it in [0,2], [3,5] etc just looks better than what precision=0 currently does i.e. (-0.0, 2.0], (2.0, 5.0] etc. (How the intervals look can be important if you're plotting the data in such a way that the interval is used as a label).

rhshadrach commented 1 year ago

Discrete bins can be accomplished through the precision parameter (although that has its own issues, https://github.com/pandas-dev/pandas/issues/51532). Apologies for not mentioning this last time. The point of this suggested improvement is to allow closed bins, as per the post title.

Then perhaps

It would be great if there was an option to specify that data is discrete, so that intervals would be like (14, 28], (28, 42] etc. (Not just integers - for example data which is measured to one dp, it would give (14.3, 28.7], (28.7, 42.4] or similar). The level of discrete-ness can be inferred from the data.

should be removed?

Gabriel-Kissin commented 1 year ago

Yes, you're right. I've edited the starting post, apologies for lack of clarity there.