pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.58k stars 17.57k forks source link

Groupby apply using a custom function displays incorrect arguments (DF) #13913

Closed bhagerman00 closed 7 years ago

bhagerman00 commented 7 years ago

Code Sample, a copy-pastable example if possible

def func3(df):
    print(df)
    df['Mean'] = np.mean(df['Data'].values)
    out = df[['Mean', 'Group']].iloc[0, :].copy(deep=True)
    return out

var1 = [1, 2, 3]
grp_var = ['Group1', 'Group2', 'Group2']
df_test = pd.DataFrame.from_dict({'Data': var1, 'Group': grp_var})

# Groupby test
df_grp = df_test.groupby(['Group'])
df_out = df_grp.apply(func3)

Surprising Print Output

   Data   Group
0     1  Group1
   Data   Group
0     1  Group1
   Data   Group  Mean
1     2  Group2   1.0
2     3  Group2
   Data   Group
0     1  Group1
   Data   Group
1     2  Group2
2     3  Group2

In [31]: df_out
Out[31]:
        Mean   Group
Group
Group1   1.0  Group1
Group2   2.5  Group2

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: 0.7.2
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.6.1
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Summary

The final output is correct, but the intermediate argument calls appear to be referencing incorrect data frames. There are two groups to iterate over, but 5 statements are printed, one of which contains a new column called 'Mean' and is printed under Group2, but the mean value is for Group1. This seems to occur when a pandas series object is returned, and makes it difficult to interactively evaluate/debug groupby calls using custom arguments. The printed DF's should match the DF for the given group being processed. Is this known behavior?

jreback commented 7 years ago

see the warning in the doc note: http://pandas.pydata.org/pandas-docs/stable/groupby.html#flexible-apply

this is an implementation detail / the function will be called multiple times to assess whether it mutates and its output shape.

further that you are doing is quite inefficient you need to do less in custom apply functions / better simply don't use them

ankur-gupta commented 7 years ago

I am concerned about the print output:

   Data   Group  Mean
1     2  Group2   1.0
2     3  Group2

This dataframe doesn't actually exist in the GroupBy object list but it is still being passed to the custom function.

I encounter datasets in which the use of custom apply functions is very helpful and straightforward. If the custom apply functions are not recommended, I would be happy to switch to a better alternative. What would you recommend as an alternative ?

jreback commented 7 years ago

you are violating guarantees of not to change the internal structure. We do allow this somewhat but you doing very odd (and completely non-performant things)

simply

In [4]: df_test.groupby('Group').mean()
Out[4]: 
        Data
Group       
Group1   1.0
Group2   2.5
bhagerman00 commented 7 years ago

Appreciate the replies. I had seen this warning previously,

In the current implementation apply calls func twice on the first group

But if this is known behavior I'm encountering wouldn't there only be 3 print statements, not the resulting 5? Something like the following:

   Data   Group
0     1  Group1
   Data   Group
0     1  Group1
   Data   Group
1     2  Group2
2     3  Group2

In the above case it's clear that the DF's being passed as arguments are accurate reflections. The print statement that @ankur-gupta pointed out shouldn't exist/be passed through the function. This makes it difficult to evaluate or debug. I understand that there is an explicit .mean() call that would be more appropriate for this case, I included the custom function as a basic example of the behavior.

jreback commented 7 years ago

anyone is welcome to debug, see if you actually think something odd is going on. This user function violates some guarantees (meaning you are modifying state of the internal functions in a non-detectable way).

I get that you think you actually want to do this. If you can find a better way great. Please post and i'll reopen.

ankur-gupta commented 7 years ago

For me, adding a new column is an extremely common use case. The content of the new column has a very complicated relationship with the content of the smaller "grouped" dataframe. For example, if there are more than 50% missing values in existing columns of the "grouped" dataframe, then I need to add a new column which has a default value, but if there are fewer missing values then I need to perform some complicated statistical operation such as interpolation. This is just an example, but there are more complicated tasks that I need to perform. So, I find that custom functions are very straightforward to use and debug, even if they're not the most efficient.

But I am happy to use an alternative to custom functions that performs the same complicated tasks.

jreback commented 7 years ago

your 'way' of doing things is not idiomatic pandas at all. pls see the documentation here

ankur-gupta commented 7 years ago

Thanks. I understand that what I am doing is not idiomatic in pandas. Coming from an R (plyr::ddply) background, I attempted to use apply in the exact same manner, which is not correct.

Thinking more about my use cases, I can simply add an empty new column to the whole dataframe and then simply fill in the values within the custom function. This way the custom function won't violate the structure.