Open sfc-gh-joshi opened 2 months ago
take
I found an additional example with groupby(sort=False)
:
>>> data = {
... "by": ["male", "male", "female", "male", "female", "male"],
... "value1": ["low", "medium", "high", "low", "high", "low"],
... "value2": ["US", "FR", "US", "FR", "FR", "FR"],
... }
>>> pd.DataFrame(data).groupby(by=["by"], sort=False).value_counts()
by value1 value2
male low FR 2
US 1
medium FR 1
female high US 1
FR 1
Name: count, dtype: int64
The female high US 1
row has moved below male medium FR 1
despite appearing earlier in the input frame.
Thanks for the report. DataFrameGroupBy.value_counts was designed to mirror DataFrame.value_counts. In particular, for the sort argument as documented here:
Sort by frequencies when True. Sort by DataFrame column values when False.
The documentation for the groupby case should make this clear. I've reworked this issue as a documentation issue.
take
Sort by frequencies when True. Sort by DataFrame column values when False.
What's the expected behavior for when both the groupby's sort argument and the value_counts sort argument are False? Right now it seems to not sort by the column values (as in the example from my original post):
>>> df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
>>> df.groupby("X", sort=False).value_counts(sort=False)
X Y
B 4 1
A 1 1
3 1
B -1 1
Name: count, dtype: int64
There seems to be an undesirable disagreement between Series.value_counts
and DataFrame.value_counts
. For Series.value_counts
:
Sort by frequencies when True. Preserve the order of the data when False.
Whereas DataFrame is:
Sort by frequencies when True. Sort by DataFrame column values when False.
It seems to me these should behave the same. I propose to change DataFrame
behavior with sort=False
should preserve the order of the input data. This is because:
sort=False
behavior would be readily achievable by following the call with .sort_index()
.sort=False
for other methods (e.g. df.groupby(..., sort=False)
).Series.value_counts
sees more activity than DataFrame.value_counts
.Currently, it appears to me the Series
behavior is not readily achievable in groupby. This is because we may need to sort by the groupings but not sort by the other columns in the DataFrame. However, I expect it should not be difficult to accomplish (but am not certain).
If we are to agree that sort=False
should preserve the order of the input in all cases, then it appears to me .groupby(..., sort=False).value_counts(sort=False)
behavior is currently correct, but .groupby(..., sort=True).value_counts(sort=False)
is not.
@pandas-dev/pandas-core - I'd like thoughts on the proposal in the comment immediately above.
Assuming the proposal is acceptable, I think:
DataFrame.value_counts
(especially since sort=False
is not the default);.sort_index()
)warrants making this change as a bugfix and without deprecation.
My initial reaction is that I agree
warrants making this change as a bugfix and without deprecation.
sure, so long as it's in 3.0 (I wouldnt' be in favour of backporting to 2.3)
I found a separate, possibly-related issue: for groupby(sort=False).value_counts(sort=True, normalize=True)
, normalization is applied after the frequency column is sorted. This isn't a concern for plain DataFrame
/Series.value_counts
since those normalize against the size of the whole dataset, but the distinction is actually meaningful for GroupBy
since it normalizes against the size of each group.
>>> import pandas as pd
>>> data = {"by": ["a", "a", "a", "b", "b", "b", "b", "b", "b"], "values": [1, 1, 0, 1, 1, 1, 0, 0, 0]}
>>> df = pd.DataFrame(data)
>>> df.groupby(by="by", sort=False).value_counts(sort=True, normalize=True)
by values
b 1 0.500000
0 0.500000
a 1 0.666667
0 0.333333
Name: proportion, dtype: float64
>>> df.groupby(by="by", sort=False).value_counts(sort=True, normalize=False)
by values
b 1 3
0 3
a 1 2
0 1
Name: count, dtype: int64
Is this a documentation issue, or a behavioral one? Either way, should I file this as a separate bug, or is this within scope for your proposal @rhshadrach?
@sfc-gh-joshi - the documentation for sort=True
states Sort by frequencies
. Isn't that the current behavior? In any case, this appears separate to me, can you open a new issue.
the documentation for sort=True states Sort by frequencies.
That's true, it just feels counterintuitive that there's a scenario where sort=True
where the column in question is not sorted in the expected fashion. I've filed #59725 for further discussion.
@pandas-dev/pandas-core - I'd like thoughts on the proposal in the comment immediately above.
Assuming the proposal is acceptable, I think:
- the inconsistency of the API;
- the expected infrequent use of
DataFrame.value_counts
(especially sincesort=False
is not the default);- the impact on users code if not changed (would have to be sensitive to row-order); and
- the ease of changing back to previous behavior (adding
.sort_index()
)warrants making this change as a bugfix and without deprecation.
+1
Agreed that for the above mentioned reasons (https://github.com/pandas-dev/pandas/issues/59307#issuecomment-2302969460), it sounds good to do this as a breaking change without deprecation, and then best done in 3.0.
Pandas version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
(possibly related to #55951)
When a GroupBy.value_counts operation is performed, the order of rows within the non-grouping columns does not respect the order of elements in the original frame.
In this example, with value_counts(sort=False) I would expect the row
B 4 1
to appear aboveB -1 1
, as that would correspond go their order in the original frame.When the groupby has sort=False set, this order is respected.
Expected Behavior
Calling groupby(sort=True).value_counts(sort=False) should preserve the order of members within groups, or documentation should be changed to reflect this.
Installed Versions