pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.75k stars 17.96k forks source link

ValueError in series.py when using transform on groupby object #26969

Open PeterJPRoche opened 5 years ago

PeterJPRoche commented 5 years ago

I am seeing a ValueError when running the following example code

import pandas as pd
import numpy as np

if __name__ == '__main__':

    print(pd.__version__)
    print(np.__version__)

    df = pd.DataFrame({'ID': [1,1,2,3,3], 'VALUE': [110,110,200,300,301]})
    dfg = df.groupby('ID')['VALUE']
    print(dfg.get_group(1))

    df['MODE_VALUE'] = df.groupby('ID')['VALUE'].transform(lambda x: x.mode())
    print(df['MODE_VALUE'])

the output is

0.24.2
1.15.4
0    110
1    110

Traceback (most recent call last):
...
File "..../models/testing/test_env.py", line 13, in <module>
    df['MODE_VALUE'] = df.groupby('ID')['VALUE'].transform(lambda x: x.mode())
  File "..../lib/python3.6/site-packages/pandas/core/groupby/generic.py", line 941, in transform
    s = klass(res, indexer)
  File "..../lib/python3.6/site-packages/pandas/core/series.py", line 255, in __init__
    .format(val=len(data), ind=len(index)))
ValueError: Length of passed values is 1, index implies 2

however the index and data being checked in series.py (line 255) is

Name: VALUE, dtype: int64
[110]
Int64Index([0, 1], dtype='int64')

There only seems to be one value [110] in the data array, not two [110,110]. I would have expected the transform to operate on a series with the two values.

The code works fine (as I expected) in my previous version of pandas 0.22.0. I noticed the issue only after updating to pandas 0.24.2

0.22.0
1.15.4
0    110
1    110
Name: VALUE, dtype: int64
0    110
1    110
2    200
3    300
4    301
Name: MODE_VALUE, dtype: int64

I am not sure if this is expected behaviour or a bug, just posting here to get some help! There are features in 0.24.2 I would like (e.g. DataFrame.to_numpy()) to use but am stuck on this breaking code issue!

Thanks

jbrockmendel commented 5 years ago

Thanks for the report; this is definitely a bug. It looks like Series.mode() is returning a single-entry Series (groupby.generic L1011) while SeriesGroupby.transform expects either a scalar or an array of the same length as the calling Series.

WillAyd commented 5 years ago

I'm not as sure about this. mode is not like other reduction operations because it doesn't necessarily reduce down to a scalar. In your example both 300 and 301 would be the mode for ID 3.

What are you expecting to have happen here?

PeterJPRoche commented 5 years ago

Hi, sorry, just coming back to this now.

In the case of a series that has two or more modes, then the distribution is multimodal, it does not have a single mode. It is up to the user to decide how they wish to handle that in their application/problem at hand. I guess it would be neat if the could return the array of modes when there is more than one.

With regards the issue above, I would just have expected the same behavior/answers as what 0.22.0 gave. Why the change of behavior from 0.22.0 to 0.24.2? Incidentally, I happened to notice this same issue is present in 0.25.0

TomAugspurger commented 5 years ago

I don't think we would want to return an array of modes when there are multiple That would mean you sometimes return an object-dtype, sometimes numeric-dtpye depending on the data.

PeterJPRoche commented 5 years ago

I would agree with @TomAugspurger about the return types - a mixed bag of return types could be problematic...

I think there are two issues here:

  1. Why was the ValueError raised? - that was my original issue. Is it a bug, or is that the intended behavior when doing a groupby -> transform -> mode? Is it understood why is changed between pandas versions.
  2. What gets returned when the mode is multi-modal? I think this is a separate, but related issue that needs addressed. In my code snipped above, the multi-modal case of ID=3 originally returned the single values that were associated with that ID.
TomAugspurger commented 5 years ago

I'm not sure. Are you interested in debugging those?

We'll still need a way to handle multi-modal groups. I'm not sure what's best here. We don't want data-dependent behavior (i.e. only raising when there happens to be multiple modes). But I'm not sure we want a different default way of handling multiple modes.