pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.42k stars 17.85k forks source link

BUG: GroupBy.value_counts doesn't preserve original order for non-grouping rows #59307

Open sfc-gh-joshi opened 2 months ago

sfc-gh-joshi commented 2 months ago

Pandas version checks

Reproducible Example

import pandas as pd
df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
df.groupby("X", sort=False).value_counts(sort=False)  # works as expected
df.groupby("X", sort=True).value_counts(sort=False)  # column "Y" does not match original order

Issue Description

(possibly related to #55951)

When a GroupBy.value_counts operation is performed, the order of rows within the non-grouping columns does not respect the order of elements in the original frame.

In this example, with value_counts(sort=False) I would expect the row B 4 1 to appear above B -1 1, as that would correspond go their order in the original frame.

>>> df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
>>> df.groupby("X", sort=False).value_counts(sort=False)
X  Y 
A   1    1
    3    1
B  -1    1
    4    1
Name: count, dtype: int64

When the groupby has sort=False set, this order is respected.

>>> df.groupby("X", sort=False).value_counts(sort=False)
X  Y 
B   4    1
A   1    1
    3    1
B  -1    1
Name: count, dtype: int64

Expected Behavior

Calling groupby(sort=True).value_counts(sort=False) should preserve the order of members within groups, or documentation should be changed to reflect this.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 67a58cddc2f8c0e30cb0123589947d9b3f073720 python : 3.11.7 python-bits : 64 OS : Darwin OS-release : 23.5.0 Version : Darwin Kernel Version 23.5.0: Wed May 1 20:14:38 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6020 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 3.0.0.dev0+1239.g67a58cddc2 numpy : 2.1.0.dev0+git20240720.d489c83 pytz : 2024.1 dateutil : 2.9.0.post0 pip : 23.3.1 Cython : None sphinx : None IPython : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None psycopg2 : None pymysql : None pyarrow : None pyreadstat : None pytest : None python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlsxwriter : None zstandard : None tzdata : 2023.4 qtpy : None pyqt5 : None
sgysherry commented 2 months ago

take

sfc-gh-joshi commented 2 months ago

I found an additional example with groupby(sort=False):

>>> data = {
...     "by": ["male", "male", "female", "male", "female", "male"],
...     "value1": ["low", "medium", "high", "low", "high", "low"],
...     "value2": ["US", "FR", "US", "FR", "FR", "FR"],
... }
>>> pd.DataFrame(data).groupby(by=["by"], sort=False).value_counts()
by      value1  value2
male    low     FR        2
                US        1
        medium  FR        1
female  high    US        1
                FR        1
Name: count, dtype: int64

The female high US 1 row has moved below male medium FR 1 despite appearing earlier in the input frame.

rhshadrach commented 2 months ago

Thanks for the report. DataFrameGroupBy.value_counts was designed to mirror DataFrame.value_counts. In particular, for the sort argument as documented here:

Sort by frequencies when True. Sort by DataFrame column values when False.

The documentation for the groupby case should make this clear. I've reworked this issue as a documentation issue.

KevsterAmp commented 2 months ago

take

sfc-gh-joshi commented 1 month ago

Sort by frequencies when True. Sort by DataFrame column values when False.

What's the expected behavior for when both the groupby's sort argument and the value_counts sort argument are False? Right now it seems to not sort by the column values (as in the example from my original post):

>>> df = pd.DataFrame({"X": ["B", "A", "A", "B"], "Y": [4, 1, 3, -1]})
>>> df.groupby("X", sort=False).value_counts(sort=False)
X  Y 
B   4    1
A   1    1
    3    1
B  -1    1
Name: count, dtype: int64
rhshadrach commented 1 month ago

There seems to be an undesirable disagreement between Series.value_counts and DataFrame.value_counts. For Series.value_counts:

Sort by frequencies when True. Preserve the order of the data when False.

Whereas DataFrame is:

Sort by frequencies when True. Sort by DataFrame column values when False.

It seems to me these should behave the same. I propose to change DataFrame behavior with sort=False should preserve the order of the input data. This is because:

Currently, it appears to me the Series behavior is not readily achievable in groupby. This is because we may need to sort by the groupings but not sort by the other columns in the DataFrame. However, I expect it should not be difficult to accomplish (but am not certain).

If we are to agree that sort=False should preserve the order of the input in all cases, then it appears to me .groupby(..., sort=False).value_counts(sort=False) behavior is currently correct, but .groupby(..., sort=True).value_counts(sort=False) is not.

rhshadrach commented 1 month ago

@pandas-dev/pandas-core - I'd like thoughts on the proposal in the comment immediately above.

Assuming the proposal is acceptable, I think:

warrants making this change as a bugfix and without deprecation.

MarcoGorelli commented 1 month ago

My initial reaction is that I agree

warrants making this change as a bugfix and without deprecation.

sure, so long as it's in 3.0 (I wouldnt' be in favour of backporting to 2.3)

sfc-gh-joshi commented 1 month ago

I found a separate, possibly-related issue: for groupby(sort=False).value_counts(sort=True, normalize=True), normalization is applied after the frequency column is sorted. This isn't a concern for plain DataFrame/Series.value_counts since those normalize against the size of the whole dataset, but the distinction is actually meaningful for GroupBy since it normalizes against the size of each group.

>>> import pandas as pd
>>> data = {"by": ["a", "a", "a", "b", "b", "b", "b", "b", "b"], "values": [1, 1, 0, 1, 1, 1, 0, 0, 0]}
>>> df = pd.DataFrame(data)
>>> df.groupby(by="by", sort=False).value_counts(sort=True, normalize=True)
by  values
b   1         0.500000
    0         0.500000
a   1         0.666667
    0         0.333333
Name: proportion, dtype: float64
>>> df.groupby(by="by", sort=False).value_counts(sort=True, normalize=False)
by  values
b   1         3
    0         3
a   1         2
    0         1
Name: count, dtype: int64

Is this a documentation issue, or a behavioral one? Either way, should I file this as a separate bug, or is this within scope for your proposal @rhshadrach?

rhshadrach commented 1 month ago

@sfc-gh-joshi - the documentation for sort=True states Sort by frequencies. Isn't that the current behavior? In any case, this appears separate to me, can you open a new issue.

sfc-gh-joshi commented 4 weeks ago

the documentation for sort=True states Sort by frequencies.

That's true, it just feels counterintuitive that there's a scenario where sort=True where the column in question is not sorted in the expected fashion. I've filed #59725 for further discussion.

Dr-Irv commented 3 weeks ago

@pandas-dev/pandas-core - I'd like thoughts on the proposal in the comment immediately above.

Assuming the proposal is acceptable, I think:

  • the inconsistency of the API;
  • the expected infrequent use of DataFrame.value_counts (especially since sort=False is not the default);
  • the impact on users code if not changed (would have to be sensitive to row-order); and
  • the ease of changing back to previous behavior (adding .sort_index())

warrants making this change as a bugfix and without deprecation.

+1

jorisvandenbossche commented 3 weeks ago

Agreed that for the above mentioned reasons (https://github.com/pandas-dev/pandas/issues/59307#issuecomment-2302969460), it sounds good to do this as a breaking change without deprecation, and then best done in 3.0.