Grouped operations - Githubissues

sashahafner commented 9 months ago

The Pandas module has this functionality, but it is strange. Here is an example that I spent a lot of time on.

dat['ech4'] = dat.groupby(['reactor']).\
        apply(lambda x: fintegrate(x['day'], x['qch4'])).\
        reset_index(['reactor'], drop = True)

So fintegrate is a function that expects to arguments, otherwise the apply bit could be replaced with a method. The oddest bit is that the groupby method doesn't return a data frame. Why??? Instead a series or Series or both or whatever. And it is the indices in that output that causes problems when trying to add it back to a data frame. It is like the developers thought users would want to display the results in the console and not save them. Strange.

The backslash bit was from a Stack Overflow answer without explanation. Seems to allow splitting lines at the dot operator.

When simple methods can be used I think it is simpler.

means = pd.DataFrame(tot.groupby(['gas', 'temp'])[['ech4', 'logemis']].mean())

Here I have applied mean to two columns in a dataframe grouped by two other columnes.

There is also an aggregate function or method in Pandas for this stuff.

sashahafner commented 9 months ago

This answer was quite helpful https://stackoverflow.com/questions/34099684/how-to-use-groupby-transform-across-multiple-columns/74555697#74555697

Others that suggest splitting operation are missing the point.

sashahafner commented 8 months ago

Note that I have some examples worked out in https://github.com/AU-BCE-EE/OAC-course-private

sashahafner commented 8 months ago

Few tips:

reset_index() at end can help in getting reasonable columns
Still, data frame output isn't quite a simple data frame, seems like columns have two levels of names

airw['rem_eff'] = 100 * (1 - airw['mass_tot']['Out'] / airw['mass_tot']['In'])

sashahafner commented 8 months ago

print(airw.keys)

keys helps a little, only a little

sashahafner commented 8 months ago

One issue is controlling the name of a new column. Here is an example where new column is named 0 and isn't clear to me how it can be simply set.

tot = pd.DataFrame(dat.groupby(['reactor', 'gas', 'temp']).apply(lambda x: mintegrate(x.day, x.qch4, value = 'total'))).reset_index()
tot.rename({0:'ech4'}, axis = 'columns', inplace = True)
tot

sashahafner commented 6 months ago

I see there is an assign method/function that adds columns to data frames and is quite helpful for grouped operations. See this example from Anna's work:

dat = dat.groupby('tank').apply(lambda x: x.assign(emis = si.cumulative_trapezoid(x['flux'], x['time'], initial = 0)))

And some info online:

See this solution https://stackoverflow.com/questions/73309294/how-to-apply-scipy-integrate-cumulative-trapezoid-to-grouped-pandas-dataframe-wi

Presumably assign is the proper way to add columns to data frames, and this would work with mintegrate or any other function that returns an appropriate array as well

sashahafner / pystupid

Grouped operations #18