Open Gabriel-Kissin opened 1 year ago
Thanks for the request.
The level of discrete-ness can be inferred from the data.
Can you give some details here?
A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.
I think this is independent of discrete bins; I would recommend making this a separate issue. But I don't understand what cannot be accomplished with half-open bins that can be accomplished with closed bins here, though that discussion is probably best for a separate issue.
Not perfect because the final group is larger than the others (as a result of using np.floor), but illustrates what I'm getting at.
Is there a proposed algorithm?
The level of discrete-ness can be inferred from the data.
Can you give some details here?
I was imagining an option where by default data is continuous; but it can either be specified a certain level of discreteness, or set to 'infer'. If set to 'infer', it could do sthg along these lines:
discrete_data = np.linspace(0,10,200)
smallest_diff = np.diff(np.sort(discrete_data)).min()
inferred_precision = np.floor(np.log10(smallest_diff))
level_discreteness = 10 ** inferred_precision
print(smallest_diff, inferred_precision, level_discreteness, sep='\t')
# outputs 0.05025125628140614 -2.0 0.01
Looking over my post again, I see I forgot to mention that the precision
parameter of pd.cut
can be used to make bins at the specified precision. So using the above data, if we want it in 7 bins:
pd.cut(discrete_data, bins=7, )
# [(-0.01, 1.429], (-0.01, 1.429], (-0.01, 1.429], (-0.01, 1.429], (-0.01, 1.429], ..., (8.571, 10.0], (8.571, 10.0], (8.571, 10.0], (8.571, 10.0], (8.571, 10.0]]
pd.cut(discrete_data, bins=7, precision=-int(inferred_precision))
# [(-0.01, 1.43], (-0.01, 1.43], (-0.01, 1.43], (-0.01, 1.43], (-0.01, 1.43], ..., (8.57, 10.0], (8.57, 10.0], (8.57, 10.0], (8.57, 10.0], (8.57, 10.0]]
- the second time has more appropriate bins.
A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.
I think this is independent of discrete bins; I would recommend making this a separate issue. But I don't understand what cannot be accomplished with half-open bins that can be accomplished with closed bins here, though that discussion is probably best for a separate issue.
Discrete bins can be accomplished through the precision
parameter (although that has its own issues, https://github.com/pandas-dev/pandas/issues/51532). Apologies for not mentioning this last time. The point of this suggested improvement is to allow closed bins, as per the post title.
The advantage of closed bins is that to me it just looks neater. Particularly with a feature representing a count, e.g. how many times each person works out a week, binning it in [0,2], [3,5] etc just looks better than what precision=0
currently does i.e. (-0.0, 2.0], (2.0, 5.0] etc. (How the intervals look can be important if you're plotting the data in such a way that the interval is used as a label).
Discrete bins can be accomplished through the precision parameter (although that has its own issues, https://github.com/pandas-dev/pandas/issues/51532). Apologies for not mentioning this last time. The point of this suggested improvement is to allow closed bins, as per the post title.
Then perhaps
It would be great if there was an option to specify that data is discrete, so that intervals would be like (14, 28], (28, 42] etc. (Not just integers - for example data which is measured to one dp, it would give (14.3, 28.7], (28.7, 42.4] or similar). The level of discrete-ness can be inferred from the data.
should be removed?
Yes, you're right. I've edited the starting post, apologies for lack of clarity there.
Feature Type
[X] Adding new functionality to pandas
[ ] Changing existing functionality in pandas
[ ] Removing existing functionality in pandas
Problem Description
pd.cut and pd.qcut create intervals and partition the dataset according to those intervals. The intervals are generally open on the left and closed on the right, and with continuous data are fine.
However with discrete data this can sometimes give slightly ridiculous intervals. Integers are a case in point. If we have numbers from 0 to 99, pd.cut makes intervals like (-0.099, 14.1429], (14.1429, 28.2857].
Here is a MRE:
which outputs the following:
It would be great if there was an option to specify that data is discrete, so that intervals would be like (14, 28], (28, 42] etc. (Not just integers - for example data which is measured to one dp, it would give (14.3, 28.7], (28.7, 42.4] or similar). The level of discrete-ness can be inferred from the data. [EDIT: ATM you can achieve sthg similar using the
precision=
parameter, but this doesn't automatically infer from the data. The main point of this suggestion is the next point, that intervals should be closed].A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.
For example if the data is "number of times something happened", intervals like those described would be more intuitive.
It isn't particularly hard to work round this, but this might be a useful feature to add.
Feature Description
Some rough code:
gives
Not perfect because the final group is larger than the others (as a result of using np.floor), but illustrates what I'm getting at.
Alternative Solutions
potentially multiple ways of solving this...
Additional Context
No response