pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.23k stars 1.67k forks source link

Redesign `cut` and `qcut` #10468

Open lorentzenchr opened 10 months ago

lorentzenchr commented 10 months ago

Problem description

The current (v0.18.13) function signatures are (similar in expression module):

 def cut(
        self,
        breaks: list[float],
        labels: list[str] | None = None,
        break_point_label: str = "break_point",
        category_label: str = "category",
        *,
        series: bool = True,
        left_closed: bool = False,
        include_breaks: bool = False,
    ) -> DataFrame | Series:

 def qcut(
        self,
        quantiles: list[float] | int,
        *,
        labels: list[str] | None = None,
        break_point_label: str = "break_point",
        category_label: str = "category",
        series: bool = True,
        left_closed: bool = False,
        allow_duplicates: bool = False,
        include_breaks: bool = False,
    ) -> DataFrame | Series:

Propositions:

stinodego commented 10 months ago

Thanks for making an issue. I'm picking this up.

magarick commented 10 months ago
  • Align keyword only arguments, e.g. labels is keyword only in qcut but allowed positional in cut.

This was a mistake on my part. Aligning is good.

I don't think that deals with the core issue of one parameter being used for two things unambiguously. probs or some variant, as it was before isn't right, adding levels doesn't help. Restating my suggestion in the other thread here for completeness, I think two functions, qcut and qcut_uniform are best for clarity if we're not sticking with the "traditional" q (which kind of does feel like a cop-out). I also don't like having two parameters, one of which could be none either, because then you have to remember an order, or two argument names, and could accidentally try to specify both.

@lorentzenchr Regarding the other thread, I'm sorry if I offended you but I'm genuinely confused about what came off as harsh language. I enjoy salty language but I didn't think I was anywhere close to it here.

orlp commented 10 months ago

Nothing around quantiles has anything to do with probabilities, so that's definitely a -1 from me on naming anything probabilities.

orlp commented 10 months ago

Here's what I'd suggest

 def cut(
        self,
        breaks: list[float],
        labels: list[str] | None = None,
        *,
        left_closed: bool = True,
        include_intervals: bool = False,
    ) -> Series | DataFrame:

 def cut_quantiles(
        self,
        breaks: list[float] | int,
        labels: list[str] | None = None,
        *,
        left_closed: bool = True,
        include_intervals: bool = False,
    ) -> Series | DataFrame:

Changes from current API:

With also the following change in qcut:

stinodego commented 10 months ago

I agree with most of your proposal, @orlp . I think we should just merge a breaking redesign of cut/qcut in 0.19.0 and highlight it in the release notes. The current design is leading to too many problems.

series arguments removed - always returns a Series if include_intervals is false, and a DataFrame if include_intervals is true.

The expression variants cannot return a DataFrame, they will return a Struct column. The Series methods could also just return a Struct column - although returning a DataFrame might be more ergonomic in some instances. Either way is fine.

left_closed made True by default, at least in programming this is the default.

I don't know what the reason is for the right-closed default. pandas also has right-closed as a default. If we set left-closed as a default, the parameter should be named right_closed and default to False.

Removal of allow_duplicates and change of semantics. Instead of first choosing the quantile values based on the breakpoints and then filtering (leading to problems), we instead semantically sort the array and split the array according to the breakpoints before assigning labels/intervals.

What would be the performance impact on this? If it's expensive, this might not be the best idea.

lorentzenchr commented 10 months ago

Nothing around quantiles has anything to do with probabilities

That's just not true. Ask any a statistician or refer to literature. Quantiles are, in a way, the inverse of a probability distribution. I would be very interested in any definition of quantiles without the notions of probability, sample and statistics.

orlp commented 10 months ago

Removal of allow_duplicates and change of semantics. Instead of first choosing the quantile values based on the breakpoints and then filtering (leading to problems), we instead semantically sort the array and split the array according to the breakpoints before assigning labels/intervals.

What would be the performance impact on this? If it's expensive, this might not be the best idea.

If properly implemented, nothing. To find the quantiles themselves you already need to do the sort / quickselect, which also provides you with the partition.

orlp commented 10 months ago

That's just not true. Ask any a statistician or refer to literature. Quantiles are, in a way, the inverse of a probability distribution. I would be very interested in any definition of quantiles without the notions of probability, sample and statistics.

None of that has anything to do with the computation of the quantiles of a particular dataset. Probability, distributions, statistics only come into play when interpreting the meaning of those quantiles, assuming your dataset corresponds to a sample of an underlying distribution. We're discussing the computational method here, not any statistical interpretation of it (which may or may not be relevant for the user).

I would be very interested in any definition of quantiles without the notions of probability, sample and statistics.

Quite simple, for the qth quantile of n elements let r = q(n-1). Then the result is the element with rank r if r is integer, and if it is not let lo be the element with rank floor(r) and hi be the element with rank ceil(r). Then the result is lo + (hi - lo) * frac(r), that is, the linear interpolation of the elements with ranks floor(r) and ceil(r).

Note that the linear interpolation is not necessary if you simply wish to partition the series. All elements with rank r' < r go in one bin, those with rank r' >= r go in the next bin, regardless of whether r = q(n-1) is integer.

lorentzenchr commented 10 months ago

We're discussing the computational method here, not any statistical interpretation of it (which may or may not be relevant for the user).

First, this issue is not about computational methods. Second, I am a user of quantiles and of qcut. Let's say I want 3 equally spaced (in probability space) quantiles, so the 25%-, 50%- and 75%-quantile. 0.25, 0.5 and 0.75 are exactly the probabilities, or the frequency in the data/sample, of another value being smaller or equal then the corresponding quantile. Therefore this argument needs a name, and in particular the R language makes in clear that it is the probability level of the quantile to be calculated. In other words, one calculates empirical quantiles of a given sample/data.

Quite simple ... Then the result is the element with rank r ...

So you are using the order statistic of a given sample/data set. That's statistics with a notion of probability.

orlp commented 10 months ago

In other words, one calculates empirical quantiles of a given sample/data. [...] So you are using the order statistic of a given sample/data set. That's statistics with a notion of probability.

No, again, you are applying a statistical interpretation to it. I quote from Wikipedia (emphasis mine):

In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value.

We are talking about computational methods on Series here. Not statistical methods on StatisticalSample. Not all Series are statistical samples.

If you view the input as a statistical sample with an underlying distribution, then sure, yes, the computational method can be interpreted to compute a statistical quantity. But your reasoning is circular, you derive "this is a statistical/probabilistic process" starting from the assumption... that this is a probabilistic sample of an underlying distribution. Which is just not necessarily true.

As a counterexample, cut_quantiles(breaks=3) can be used on a Series containing the date-range of [2022-01-01, 2023-01-01) (so 365 values) to get a classification of dates in that year to quarters. This is a perfectly valid application of cut_quantiles, but there is no sample nor an underlying distribution of a probabilistic process.

Either way, I think breaks is a perfectly fine name for the parameter specifying at which points we want to partition the data after sorting in all cases (whether you give that a probabilistic interpretation or not).

stinodego commented 10 months ago

Let's go with breaks. @orlp could you set up a PR for this?

lorentzenchr commented 10 months ago

To be honest, I really do not like this discussion.

There is the issue of having a separate parameter to say: "Hello qcut, I just want 5 equally spaced quantiles." This deserves its own parameter, something starting with an n_xxx seems appropriate to me in order to say, n number of breaks, maybe just n_breaks.

Then there is the point of specifying which quantiles exactly. For the naming, I would at least look at what other libraries and languages do:

The paper Sample Quantiles in Statistical Packages is THE reference on how to calculate quantiles.

So it seems evident to me, that there are probabilities involved. As the logic of qcut is to first compute quantiles and then doing a digitization step, the original meaning for quantiles still holds.

BTW, I really feel offended by someone repeatedly saying quantiles have nothing to do with probabilities. As info: For a statistician, a set of data points, e.g. a polars.Series (no matter its type), is just called a sample.

More disclosure: I teach at the department of mathematics at a university.

orlp commented 10 months ago

Then there is the point of specifying which quantiles exactly. For the naming, I would at least look at what other libraries and languages do:

None of the linked functions perform a quantile partition. Pandas' qcut just says:

Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

The paper Sample Quantiles in Statistical Packages is THE reference on how to calculate quantiles.

We aren't computing quantiles here. I have no objection to using this nomenclature on the quantile function which does not really have a non-statistical interpretation for a continuous input variable.

Perhaps we should drop the quantile nomenclature here altogether to avoid the confusion? cut_ranked? cut_sorted? Any other suggestions? Or maybe cut and qcut should be renamed to partition and partition_sorted?

As the logic of qcut is to first compute quantiles and then doing a digitization step, the original meaning for quantiles still holds.

But this is wrong (in the proposed redesign, it does match the current problematic implementation), and precisely what caused a variety of issues around duplicates. Consider the array [0, 1, 1, 1, 1, 1, 1, 2]. If you apply your method with qcut([0.5]) you would get partition [[0, 1, 1, 1, 1, 1, 1], [2]] or [[0], [1, 1, 1, 1, 1, 1, 2]] depending on which side you put equal elements. The answer we really want for this method is [[0, 1, 1, 1], [1, 1, 1, 2]], which can not be achieved by first computing quantiles and then performing digitization.

BTW, I really feel offended by someone repeatedly saying quantiles have nothing to do with probabilities.

I should clarify, I meant not necessarily having something to do with probability. Obviously in a statistical context there are probabilities involved. And of course it's nothing personal.

orlp commented 10 months ago

@lorentzenchr Perhaps you find my argument more convincing if you consider that both 'cut' and 'qcut' are perfectly well-defined functions over series of strings.

orlp commented 10 months ago

Another bikeshed suggestion: bin and bin_sorted?

lorentzenchr commented 10 months ago

A cut_rank or cut_partition would be fine and avoid some confusion. But to me, it is a different/additional function doing something different.

Modern gradient boosting libs like XGBoost, LightGBM and scikit-learn use histograms and perform exactly the operation of qcut with quantiles as bin boundaries: 1. Compute quantiles (and save them) 2. Bin the data according to those thresholds. The proposed logic of cut_rank would not work here, because a single value like 1 needs to always be in the same bin.

lorentzenchr commented 10 months ago

@lorentzenchr Perhaps you find my argument more convincing if you consider that both 'cut' and 'qcut' are perfectly well-defined functions over series of strings.

As long as strings have an order/are sortable (which they are), they fully qualify as a sample on which to calculate quantiles with specified probability levels.

orlp commented 10 months ago

A cut_rank or cut_partition would be fine and avoid some confusion. But to me, it is a different/additional function doing something different.

Modern gradient boosting libs like XGBoost, LightGBM and scikit-learn use histograms and perform exactly the operation of qcut with quantiles as bin boundaries: 1. Compute quantiles (and save them) 2. Bin the data according to those thresholds. The proposed logic of cut_rank would not work here, because a single value like 1 needs to always be in the same bin.

I think that's a fair point, I will give it some more thought.

As long as strings have an order/are sortable (which they are), they fully qualify as a sample on which to calculate quantiles with specified probability levels.

That doesn't make sense to me. What would be the quantile with probability level 0.5 for the sample ["bar", "baz", "foo", "moo"]?

orlp commented 10 months ago

Alright, proposed redesign, take two. We remove cut and qcut (which I honestly feel are pretty bad names anyway) in favor of three binning functions:

 def bin_intervals(
        self,
        intervals: list[Any] | int,
        labels: list[str] | None = None,
        *,
        include_intervals: bool = False,
        right_closed: bool = False,
    ) -> Series | DataFrame:

 def bin_quantiles(
        self,
        quantiles: list[float] | int,
        labels: list[str] | None = None,
        *,
        include_intervals: bool = False,
        right_closed: bool = False,
        duplicates: str = "first"
    ) -> Series | DataFrame:

 def bin_ranks(
        self,
        ranks: list[float] | int,
        labels: list[str] | None = None,
        *,
        include_intervals: bool = False,
    ) -> Series | DataFrame:

Changes from current API:

With also the following change to cut:

The following change to qcut:

And a new function:

How does that sound @lorentzenchr ?

stevenlis commented 10 months ago

Respectfully, this lengthy discussion/argument appears very unproductive.

Everything in the cut function works as expected. However, the original OP pointed out some issues with the qcut function, and @stinodego also mentioned some legitimate concerns about its current behavior.

But I don't understand why quantile_levels is considered clearer than quantiles. Why do we need to change left_closed to right_closed?

cut is typically used with user-defined breakpoints, which could be completely arbitrary. For example, if you want to cut an income variable as high, middle, or low, you can define those groups as you wish. However, qcut takes into consideration the distribution of the variable, so it functions differently. These are two distinct functions, and I don't understand why their parameters must be identical.

I'm unsure how an argument regarding the connection between quantities and probability led to the need for changing two existing function names, their parameter names, and their default values, which breaks everything.

@orlp Everything in cut works as expected. cut is a well-established name in statistical software. Once you've used it, I don't think you'll be confused about its purpose based on its name. I hope you don't have to change anything, but of course a Series method "always returns a Series" makes sense. You mentioned the need for equal-sized portions. However, this task should be handled by qcut (like in pandas). If polars' qcut currently cannot achieve this, it should be incorporated somehow. Otherwise, I think you can just let breaks to take both int and list, but I seriously don't think we need to change its name back to intervals again.

orlp commented 10 months ago

I'm unsure how an argument regarding the connection between quantities and probability led to the need for changing two existing function names, their parameter names, and their default values, which breaks everything.

It didn't. It's an accumulation of https://github.com/pola-rs/polars/issues/10525, https://github.com/pola-rs/polars/issues/10483, https://github.com/pola-rs/polars/issues/10524, https://github.com/pola-rs/polars/issues/9854, this issue, etc.

Everything in cut works as expected.

Some of the the above issues disagree with that.

cut is a well-established name in statistical software.

I know it is found in pandas, do you have other examples? Either way I still think it's a bad name, and newcomers will search for 'binning' much sooner than 'cutting'. Hell, even the docstring of pd.cut:

Bin values into discrete intervals.

immediately abandons the cut nomenclature and goes straight to the more understandable concept of binning.

These are two distinct functions, and I don't understand why their parameters must be identical.

Not identical, but definitely related, and should be consistent.

You mentioned the need for equal-sized portions. However, this task should be handled by qcut (like in pandas).

Absolutely not. Binning into equal-sized quantiles is completely different than binning based on equal-sized intervals for skewed distributions. And both are different (as @lorentzenchr pointed out with regards to assigning consistent labels to identical values) from equal-sized binning based on rank. Hence, three binning functions.

Example, if we let l = [0, 0, 0, 0, 0, 1, 1.3, 1.6, 2, 3] we find that

l.bin_intervals(4) bins to [[0, 0, 0, 0, 0], [1, 1.3, 1.6], [2], [3]]
l.bin_quantiles(4) bins to [[0, 0, 0, 0, 0], [], [1, 1.3], [1.6, 2, 3]]
l.bin_ranks(4) bins to [[0, 0, 0], [0, 0, 1], [1.3, 1.6, 2], [3]]
mcrumiller commented 10 months ago

FYI Matlab uses discretize which I think is a pretty good name.

stevenlis commented 10 months ago

@orlp I don't see much relevance between the linked issues and my point. I didn't say you shouldn't fix anything in cut. Inconsistency doesn't necessarily mean they are both bad; it simply means they are different. I just don't think you need to do a complete overhaul on it. I happen to know pandas, Stata, and R, all of which have a cut function. However, if you are not fond of it, that's your preference. It seems like I misunderstood what you are trying to achieve with "allow intervals to be passed an integer" in cut. I have nothing constructive to add. Despite the disagreement, I can definitely see that your idea is not baseless but well-thought-out.

magarick commented 10 months ago

This is, um, "colorful". I'll leave the micro-bike-shedding and just state some high level facts and observations.

  1. cut is used in R and Julia (maybe others too) in addition to pandas. It was also used in Polars before I rewrote the functions to be faster and work as expressions. The point of language is to understand and be understood. For better or worse, this is a common idiom.
  2. Mentioning bin in the documentation however, is pandas only. I don't remember if that phrasing was there before me or if I added it, but either way I think it comes from the assumption that many pypolars users are familiar with pandas.
  3. The name qcut also appears unique to pandas as far as I can tell. But for better or worse, pandas is popular and a lot of people know what it means.
  4. I couldn't find anything (at least not easily) on the history of the name "cut" but I would guess it has to do with "bin" and "discretize" having multiple other meanings/connotations and trying to find a word to refer to a specific kind of binning/discretization. And someone may want to add these other methods one day.
  5. There is a function to divide data into equally sized groups based on rank but ignoring quantiles/repeated values. In both R and SQL it's called ntile and it's been requested here before. Kind of a weird name but I get what they were going for that's what a lot of people will know it as.
  6. There are yet more variants of discretization that will introduce their own issues.
  7. Skew isn't related to whether qcut and ntile will be different. It's repeated values that are the enemy.

I think the type of debate going on here is losing sight of the big picture. People will get used to, and be comfortable with, all but the most wildly confusing and unreasonable interfaces. What they're mostly asking for is more features and more flexibility. I've had a PR outstanding for quite a while to improve the labels that seems to be languishing for no good reason. And I've been stalled in adding other things people want here because of it. These are commonly used functions so I know it'll attract a lot of attention, but I think it's more important to make sure people using them can do what they need to rather than everyone being upset about the interface not conforming exactly to their personal tastes.

xuJ14 commented 9 months ago

First,

Renamed allow_duplicates to duplicates with three possible strategies: "first" bins all values equal to multiple quantiles into the first such quantile, "last" is similar but bins them into the last valid quantile, and "raise" raises an exception if there are duplicate quantiles. The default is "first".

I think 'drop' should be added, since sometimes we do not want multiple values for the breakpoint. In pandas, the relevent parameter is:

duplicates : {default ‘raise’, ‘drop’}, optional If bin edges are not unique, raise ValueError or drop non-uniques.

For polars version 0.18.15, if we set allow_duplicates=False and quantiles=2, and all the values are constant, 1. We get (-inf, 1] for this quantile, which does not make any sense since edges is not unique. The result should be Null, and it sure does in pandas.

Second, I think the range of the total quantile should be $[minimum, maximum]$, not $(-inf, inf)$. That is to say, the 0th quantile and lagest quantile should be min and max. (Or for right_closed interval, it should be $(min-\delta, max]$

lorentzenchr commented 9 months ago

What would be the quantile with probability level 0.5 for the sample ["bar", "baz", "foo", "moo"]?

We have "bar" < "baz" < "foo" < "moo", therefore the 50%-quantile is just "baz" (compare with np.quantile(np.arange(4), 0.5, method="inverted_cdf") which equals 1).

But I don't understand why quantile_levels is considered clearer than quantiles.

Because one really specifies the probability level of the quantiles, not the quantiles themselves. Lets say I want qcut with just one break at the 50%-quantile, I would specify x.qcut([0.5]). This than computes the 50%-quantile of x. Long story short: np.quantile(x, 0.5) is a quantile, 0.5 is the corresponding quantile/probability level. Those are 2 different quantities, hence 2 different names.

My intention was not to change everything, just align the parameters and behaviour of cut and qcut as much as it makes sense and also fix some naming issues (quantiles and probability/quantile level).

If I were to start from scratch, I would sympathize with https://github.com/pola-rs/polars/issues/10468#issuecomment-1682348427 because I like the explicit bin in the name. But in the end, this is maybe just a documentation issue. BTW, numpy calls it numpy.digitize.

orlp commented 9 months ago

@xuJ14

I think 'drop' should be added, since sometimes we do not want multiple values for the breakpoint.

That's what first / last do, they just allow more flexibility than drop (which is unclear about which breakpoint and thus which label is kept).

For polars version 0.18.15, if we set allow_duplicates=False and quantiles=2, and all the values are constant, 1. We get (-inf, 1] for this quantile, which does not make any sense since edges is not unique. The result should be Null, and it sure does in pandas.

I have no intention of silently introducing nulls here without a very good reason. Can you give some example code that shows exactly what you'd want and why?

I think the range of the total quantile should be [minimum, maximum].

The problem is that this doesn't really work with half-open intervals. Adding arbitrary deltas does not seem great to me. But I'll consider it.

@lorentzenchr

We have "bar" < "baz" < "foo" < "moo", therefore the 50%-quantile is just "baz" (compare with np.quantile(np.arange(4), 0.5, method="inverted_cdf") which equals 1).

That's fine by me, although it does introduce an arbitrary downwards bias by flooring rather than rounding, assuming you are referring to Method 1 from H&F as numpy describes it. Because the 65% quantile in that definition would still be "baz", right?

My intention was not to change everything, just align the parameters and behaviour of cut and qcut as much as it makes sense and also fix some naming issues (quantiles and probability/quantile level).

I understand, but we've been receiving a lot of different issues surrounding the cut and qcut functions for quite a while now that this was more the straw that broke the camel's back to trigger a redesign.

If I were to start from scratch, I would sympathize with https://github.com/pola-rs/polars/issues/10468#issuecomment-1682348427 because I like the explicit bin in the name. But in the end, this is maybe just a documentation issue. BTW, numpy calls it numpy.digitize.

That's good to hear. As a compromise I will make sure to mention quantile level / probability level in the documentation, but for consistency with the other methods (intervals, quantiles, ranks) and brevity I will call the parameter itself quantiles.

Also after talking with @stinodego another advantage of calling the new functions bin_intervals, bin_quantiles, and bin_ranks is that we can deprecate but keep around the old functions for a while so people have a transition period.

xuJ14 commented 9 months ago

@orlp Thanks.

I think 'drop' should be added, since sometimes we do not want multiple values for the breakpoint.

That's what first / last do, they just allow more flexibility than drop (which is unclear about which breakpoint and thus which label is kept).

For polars version 0.18.15, if we set allow_duplicates=False and quantiles=2, and all the values are constant, 1. We get (-inf, 1] for this quantile, which does not make any sense since edges is not unique. The result should be Null, and it sure does in pandas.

I have no intention of silently introducing nulls here without a very good reason. Can you give some example code that shows exactly what you'd want and why?

These two part is about the same topic, and 'drop' is not the same as first or the last. Please follow this code, and you will know the meaning of 'drop' here.

Pandas:

df = pd.DataFrame({
    'value': [1] * 10
})
pd.qcut(df['value'], 2, labels=False, duplicates='drop')

Polars:

pl.from_pandas(df).select(pl.col('value').qcut(2, allow_duplicates=False))

'drop' is quite useful when we do not want those points which are hard to choose which quantile there should be in.

What I hope is just that the previous polars code can have the same result as pandas do. Thank you.

As for the delta, I think we can add a parameter: precision.

orlp commented 9 months ago

Thinking about it some more, the first/last idea was kinda nonsense. There is only a unique answer, and it's based on whether right_closed is true. If false (the default) you would get [-inf, 1), [1, inf), which gives a unique answer that it should be placed in [1, inf). If you add more intervals, e.g. bin_quantiles(4) on [1, 1, ..., 1, 1], you would get bins [-inf, 1), [1, 1), [1, 1), [1, inf). The [1, 1) intervals are empty, thus all values would naturally get placed in [1, inf).

right_closed = True would conversely place all elements in (-inf, 1], since the others ((1, 1] and (1, inf]) do not contain 1.

@xuJ14 Based on this logic, it being "hard to choose which quantile they should be in" is never a thing. It's always determined exactly by the nature of the half-open intervals. So perhaps the duplicates parameter should be renamed to empty_bin with different behaviors (e.g. ignore, warn and raise), with the default probably being warn when one or more of the bins is empty.

What I hope is just that the previous polars code can have the same result as pandas do.

Why? Pandas here just starts returning NaNs. I can't imagine a scenario where a data point is so common that multiple quantile bins collapse into one... you just want to throw all of it out? Pandas' behavior here just seems awful to me. Look how it does different things for each of these cases:

pd.qcut(pd.Series([1, 1, 1, 1, 1]), 2, duplicates='drop') # 1
pd.qcut(pd.Series([0, 1, 1, 1, 1]), 2, duplicates='drop') # 2
pd.qcut(pd.Series([1, 1, 1, 1, 2]), 2, duplicates='drop') # 3
pd.qcut(pd.Series([0, 1, 1, 1, 2]), 2, duplicates='drop') # 4

In every single case above we split on the median, which is unambiguously 1. In my opinion in every case you should get exactly two bins, [-inf, 1) and [1, inf) (or right-closed intervals if that option is set), which makes the behavior consistent in every single case. What pandas does instead is

  1. Generates no bins at all, returning NaNs.
  2. Generates a single bin (-0.001, 1.0] containing everything. Why the arbitrary introduction of -0.001?
  3. Generates a single bin (0.999, 2.0] containing everything. Again, arbitrary 0.999.
  4. Generates two bins (-0.001, 1.0] and (1.0, 2.0]. Note that now 2 is in a different bin than 1, but they were grouped in the same bin in case 3! Why are elements suddenly grouped differently... for the same splitpoint, depending on irrelevant other data points?!

After evaluating the nonsense above, I have no intention to replicate what pandas does here.

As for the delta, I think we can add a parameter: precision.

I really don't like this idea, the more I think about it. It's just arbitrary and leads to bad edgecases.

xuJ14 commented 9 months ago

I see the problem. You are saying that 0th quantile is -inf, and 100th is inf. But in most of application, Excel, R, Stata, etc, 0th is the minimum and 100th is the maximum. Now let's go back to [1,1,1,...,1], we have duplicate breaks right? Moreover, from the concept of qcut(2), how could you evenly separate the 1's into two groups, if the domain has only {1}?

Now if we consider the minimum to be 0 quantile and maximum to be 100 quantile, all of your pandas example seems more reasonable right? And your opinion that all the cases' results should be [-inf, 1) and [1, inf) is not right to me.

How about just make this qcut function more general and let user decide what to do? Thanks for your thought-provoking idea.

BTW, Sample Quantiles in Statistical Packages is great as reference.

orlp commented 9 months ago

@xuJ14 No, I'm not saying the 0th quantile is -inf or 100th is inf. bin_quantiles(2) would be equivalent to bin_quantiles([0.5]), and bin_quantiles(4) would be equivalent to bin_quantiles([0.25, 0.5, 0.75]), not bin_quantiles([0, 0.25, 0.5, 0.75, 1]). That is, bin_quantiles is only defined using break points. It is implied that we are splitting the entire distribution, thus -inf and inf are endpoints of respectively the least and greatest bin.

We could make it so that for right_closed=False that all except the last interval is half-open, and for right_closed=True that all except the first interval is half-open. Then we would not need infinities and would still remain consistent. I kind of like that. So for left-closed we would split the interval [min, max] into

bin_quantiles(1) => [min, max]
bin_quantiles(2) => [min, a), [a, max]
bin_quantiles(3) => [min, a), [a, b), [b, max]
...

where a, b, etc are the specified quantiles. For right-closed we would split the interval [min, max] as such:

bin_quantiles(1) => [min, max]
bin_quantiles(2) => [min, a], (a, max]
bin_quantiles(3) => [min, a], (a, b], (b, max]
...

Now if we consider the minimum to be 0 quantile and maximum to be 100 quantile, all of your pandas example seems more reasonable right?

No, I don't think so.

lorentzenchr commented 9 months ago

Maybe a further remark on the usage of qcut in ML pipelines. There are usually 2 steps:

  1. Training
    • Calculate quantiles
    • Save them as thresholds/bin boundaries
    • Bin the data according to the thresholds/bin boundaries
  2. Prediction
    • Bin the data according to the learned thresholds/bin boundaries

This means that ML pipelines can never use qcut as it does not return the (internally) computed quantiles themselves.

orlp commented 9 months ago

@lorentzenchr In my rewrite I'm writing an efficient quantiles implementation that computes all the required quantiles at once. Perhaps we could expose that in the public API at some later point (either as its own function or by allowing a list of quantile levels for pl.quantile) after which I believe your workflow could be covered efficiently using that function followed by bin_intervals.

markxwang commented 2 months ago

Hi @orlp, just to check if you are still working on this?

The whole (scorecard) credit risk modelling/insurance community are waiting anxiously for this feature😅.

As discussed above, we would need a way to calculate the list of quantiles (breakpoints) for all columns during model training stage. This is currently not available with pl.quantile, which takes just one quantile.

Instead, I'm using the following expression to generate these:

pl.col("A")
.qcut(q, allow_duplicates=True,include_breaks=True)
.struct.field("brk")
.unique()
.sort()
.implode()
.list.shift()
.list.drop_nulls()
.alias("A")

which is working, but definitely not optimal...In the second stage, the list of breakpoints got passed over to the cut function, which works fine.

I would very much like to see this redesign happen, and would like to contribute as well, but I'm pretty much a Rust newbie (just started this week!)...can you kindly give some advice on the best way to participate?

orlp commented 2 months ago

@markxwang I am still planning on tackling this but I've been working on other higher-priority stuff at the moment.

mishavanbeek commented 1 month ago

Hi, I have been working a lot with cut and qcut and noticed some issues (https://github.com/pola-rs/polars/issues/15781 and https://github.com/pola-rs/polars/issues/15670). Both seem to stem from the use of categorical output and the combination with grouping.

With the redesign discussed here (and any change on these functions not being breaking), I would like to propose the following.

Case 1: labels=None, include_breaks=False The integer of the bucket is returned (instead of a string that's slow to process). Case 2: labels=[list_of_labels], include_breaks=False The integers are replaced with the elements of the list as a categorical variable (across all groups). This is basically the same as running .replace with a dict of labels on the output from case 1. Case 3: labels=None, include_breaks=True The function now returns a struct with an int for the bucket, a float for the lower bound of this bucket, and a float for the upper bound of the bucket. Having both bounds present is important for downstream processing (for example when doing linear interpolation on the buckets as part of a huberization). Case 2: labels=[list_of_labels], include_breaks=True Same as case 2 but with the ints in the struct replace by a categorical variable.

In all cases a null value should give nulls across the row, and a group of dataframe with only nulls should return nulls.

This approach should cover all existing functionality, in a slightly different way. It is easy to produce the current string-based output, should this be necessary for backward compatibility. Most importantly, it does away with the category concatenation issues and issues with null values that cause bugs. Whenever categories are used this is always on a fixed set that is known in advance (the labels). The approach is fully consistent between qcut and cut.

To illustrate more concretely:

import polars as pl

df = pl.DataFrame({
    "group": [1, 1, 2],
    "value": [1.0, 3.0, 2.0],
})

Without breaks:

df.with_columns(pl.col("value").qcut([0.5], include_breaks=False))
shape: (3, 2)
┌───────┬───────┐
│ group ┆ value │
│ ---   ┆ ---   │
│ i64   ┆ i64   │
╞═══════╪═══════╡
│ 1     ┆ 1     │
│ 1     ┆ 2     │
│ 2     ┆ 1     │
└───────┴───────┘

And with breaks:

df.with_columns(pl.col("value").qcut([0.5], include_breaks=True))
shape: (3, 2)
shape: (3, 2)
┌───────┬──────────────┐
│ group ┆ value        │
│ ---   ┆ ---          │
│ i64   ┆ struct[3]    │
╞═══════╪══════════════╡
│ 1     ┆ {1,-inf,2.0} │
│ 1     ┆ {2,2.0,inf}  │
│ 2     ┆ {1,-inf,2.0} │
└───────┴──────────────┘

The labeled cases should be self-explanatory.

I don't know any rust, but from looking at the codebase it seems that all the output is there to make this work.