Closed alex314159 closed 3 years ago
I think this formulation is totally fine and if people want more perf then they can move to group-by-column-agg.
Let's just talk through the interface a bit. This assumes that you want one result column for each listed input column and that you will hand colfn a column at a time. Is it likely that you will want a result column that was the result of running colfn on more than one source column? I guess then you could create a new column that was the tuple before the aggregation -- That is also true of the high performance agg pathway but in the high performance pathway you can have a result column that was dependent upon multiple input columns.
The tension I see is how to specify the grouping fn which in your case is completely general and based on the lightweight map structures. Then how to specify the result columns and their dependencies along with how to handle missing values. Clojure people like to work in row-major (sequence of maps) abstractions but for instance to deal with missing values as a set of indexes you have to have a column-major formulation so passing one or more columns to the agg fn makes some sense.
@genmeblog - What does tablecloth have in this regard?
Thanks for the feedback Chris, you raise an important point re having a colfn on more than one source column, and the missing values.
I personally don't have a big problem with, as you say, creating a new column with ds/column-map
before the aggregation when you want it to depend on several original columns.
Missing values is a bit more tricky - again, personally, I'm ok with what's essentially a reduce
operation failing on missing data, but I can also see the argument for missing data to be ignored. Probably the strategy should be an optional argument. Looking at what ds/descriptive-stats
does, it's quite useful to see :n-valid
for instance and it would probably be something I'd like to see in the output of group-agg
if missing values where to be ignored.
In the end it's about balance between being generic and ease of use. Looking at pandas
:
df.groupby(['Index','Year'])['PnL'].sum()
isn't generic but day to day very handy!
Totally agreed. We can come up with a similarly short transform. That syntax is interesting; in that case sum returns a dataset but I imagine in general sum returns a scalar.
Totally general I think would mean that colfn is actually dsfn - it takes a dataset and returns a scalar. Then you can name the result column however you like and the arguments would be something like a map of result-column-name->dsfn. The common case for that is something like what you typed where you just want mean or sum across a few columns and there is a 1->1 mapping of result-column to input column.
The python syntax really matches tablecloth's grouped dataset concept.
Just looked at tablecloth and they have a very similar process-grouped-data function - duh!
You're right about the generic case, I think the following works:
(defn group-agg [dataset grp-fn aggmap] ;aggmap is a map {result-column-name -> dsfn}
(ds/->dataset
(into [] (for [[id grp] (ds/group-by dataset grp-fn)]
(into {:group-id id} (for [[c dsfn] aggmap] [c (dsfn grp)]))))))
(defn group-agg-by-column [dataset grp-fn columns col-fn]
(group-agg dataset grp-fn (into {} (for [c columns] {c #(col-fn (% c))}))))
;Examples:
(group-agg x #(get % "Issuer") {:rows ds/row-count :cols ds/column-count})
(group-agg-by-column x #(get % "Issuer") ["Average-Index-Weight" "Index-Contribution"] dfn/sum)
;and the data exploration workflow becomes quite natural
(-> raw-data
(clean-data)
(add-features)
(group-agg-by-column)
There is the middle pathway where you want a different function for each column, but in that case you likely also want a different name in the output (to indicate that one aggregate is sum and the other stdev for instance). It would be quite verbose anyway so maybe that stays in the first, generic function.
You can safely mix tablecloth and tech.ml.dataset with no harm (almost). group-by -> aggregate path is based on what is available in R/Python cases. https://scicloj.github.io/tablecloth/index.html#Aggregate
aggregate-columns appears to be very close:
(-> x
(api/group-by ["Issuer"])
(api/aggregate-columns ["Index-Contribution" "Average-Index-Weight"] dfn/sum))
@alex314159 - What do you think? It isn't as dense as the python or your example api but it seems to me to be close.
yes I think that syntax is quite clean too, I should have spent more time looking at tablecloth's api! Thanks a lot.
Hello,
A common data exploration operation following
group-by
is to get aggregates by group.Example dataset:
The only function I found that does something similar is:
This is likely very efficient but not dataset first (because it can use several datasets), and the grouping and functions are limited (just one column, and functions in
ds-reduce
). Would something along these lines be useful? I realize this is probably a terrible implementation as goes through basic maps:columns
andcolfn
could actually be a map of{col-output-name [col-input-name colfn]}
although this adds complicationHappy to try a PR myself if you agree and can recommend where it should sit. Thanks,