Closed bhagerman00 closed 7 years ago
see the warning in the doc note: http://pandas.pydata.org/pandas-docs/stable/groupby.html#flexible-apply
this is an implementation detail / the function will be called multiple times to assess whether it mutates and its output shape.
further that you are doing is quite inefficient you need to do less in custom apply functions / better simply don't use them
I am concerned about the print output:
Data Group Mean
1 2 Group2 1.0
2 3 Group2
This dataframe doesn't actually exist in the GroupBy object list but it is still being passed to the custom function.
I encounter datasets in which the use of custom apply functions is very helpful and straightforward. If the custom apply functions are not recommended, I would be happy to switch to a better alternative. What would you recommend as an alternative ?
you are violating guarantees of not to change the internal structure. We do allow this somewhat but you doing very odd (and completely non-performant things)
simply
In [4]: df_test.groupby('Group').mean()
Out[4]:
Data
Group
Group1 1.0
Group2 2.5
Appreciate the replies. I had seen this warning previously,
In the current implementation apply calls func twice on the first group
But if this is known behavior I'm encountering wouldn't there only be 3 print statements, not the resulting 5? Something like the following:
Data Group
0 1 Group1
Data Group
0 1 Group1
Data Group
1 2 Group2
2 3 Group2
In the above case it's clear that the DF's being passed as arguments are accurate reflections. The print statement that @ankur-gupta pointed out shouldn't exist/be passed through the function. This makes it difficult to evaluate or debug. I understand that there is an explicit .mean()
call that would be more appropriate for this case, I included the custom function as a basic example of the behavior.
anyone is welcome to debug, see if you actually think something odd is going on. This user function violates some guarantees (meaning you are modifying state of the internal functions in a non-detectable way).
I get that you think you actually want to do this. If you can find a better way great. Please post and i'll reopen.
For me, adding a new column is an extremely common use case. The content of the new column has a very complicated relationship with the content of the smaller "grouped" dataframe. For example, if there are more than 50% missing values in existing columns of the "grouped" dataframe, then I need to add a new column which has a default value, but if there are fewer missing values then I need to perform some complicated statistical operation such as interpolation. This is just an example, but there are more complicated tasks that I need to perform. So, I find that custom functions are very straightforward to use and debug, even if they're not the most efficient.
But I am happy to use an alternative to custom functions that performs the same complicated tasks.
your 'way' of doing things is not idiomatic pandas at all. pls see the documentation here
Thanks. I understand that what I am doing is not idiomatic in pandas. Coming from an R (plyr::ddply
) background, I attempted to use apply
in the exact same manner, which is not correct.
Thinking more about my use cases, I can simply add an empty new column to the whole dataframe and then simply fill in the values within the custom function. This way the custom function won't violate the structure.
Code Sample, a copy-pastable example if possible
Surprising Print Output
output of
pd.show_versions()
Summary
The final output is correct, but the intermediate argument calls appear to be referencing incorrect data frames. There are two groups to iterate over, but 5 statements are printed, one of which contains a new column called 'Mean' and is printed under Group2, but the mean value is for Group1. This seems to occur when a pandas series object is returned, and makes it difficult to interactively evaluate/debug groupby calls using custom arguments. The printed DF's should match the DF for the given group being processed. Is this known behavior?