techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
678 stars 35 forks source link

Aggregate following group-by #209

Closed alex314159 closed 3 years ago

alex314159 commented 3 years ago

Hello,

A common data exploration operation following group-by is to get aggregates by group.

Example dataset:

(ds/select-columns x ["Issuer" "Index-Contribution" "Average-Index-Weight"])
=> _unnamed [3199 3]:

| Issuer | Index-Contribution | Average-Index-Weight |
|--------|--------------------|----------------------|
| AACTEC |         0.00038931 |           0.00943500 |
| ABCIHL |         0.00023518 |           0.00320986 |
| ABENGO |         0.01471959 |           0.10804269 |
| ABMMIJ |         0.00416985 |           0.04762189 |
| ABQKQD |         0.00838545 |           0.09995164 |
| ABQKQD |         0.00429965 |           0.09992776 |
| ABQKQD |         0.00053091 |           0.03451082 |
| ABQKQD |         0.00048035 |           0.00737886 |
| ACCESS |         0.00028782 |           0.04715352 |

The only function I found that does something similar is:

(ds-reduce/group-by-column-agg
  "Issuer"
  {"Issuer" (ds-reduce/first-value "Issuer") "Index-Contribution" (ds-reduce/sum "Index-Contribution") "Average-Index-Weight" (ds-reduce/sum "Average-Index-Weight")}
  [x])

This is likely very efficient but not dataset first (because it can use several datasets), and the grouping and functions are limited (just one column, and functions in ds-reduce). Would something along these lines be useful? I realize this is probably a terrible implementation as goes through basic maps:

(defn group-agg [dataset grp-fn columns colfn]
  (ds/->dataset
    (into [] (for [[id grp] (ds/group-by dataset grp-fn)]
               (into {:group-id id} (for [c columns] [c (colfn (grp c))]))))))

;example call
(group-agg x #(get % "Issuer") ["Index-Contribution" "Average-Index-Weight"] dfn/sum)

columns and colfn could actually be a map of {col-output-name [col-input-name colfn]} although this adds complication

Happy to try a PR myself if you agree and can recommend where it should sit. Thanks,

cnuernber commented 3 years ago

I think this formulation is totally fine and if people want more perf then they can move to group-by-column-agg.

Let's just talk through the interface a bit. This assumes that you want one result column for each listed input column and that you will hand colfn a column at a time. Is it likely that you will want a result column that was the result of running colfn on more than one source column? I guess then you could create a new column that was the tuple before the aggregation -- That is also true of the high performance agg pathway but in the high performance pathway you can have a result column that was dependent upon multiple input columns.

The tension I see is how to specify the grouping fn which in your case is completely general and based on the lightweight map structures. Then how to specify the result columns and their dependencies along with how to handle missing values. Clojure people like to work in row-major (sequence of maps) abstractions but for instance to deal with missing values as a set of indexes you have to have a column-major formulation so passing one or more columns to the agg fn makes some sense.

@genmeblog - What does tablecloth have in this regard?

alex314159 commented 3 years ago

Thanks for the feedback Chris, you raise an important point re having a colfn on more than one source column, and the missing values.

I personally don't have a big problem with, as you say, creating a new column with ds/column-map before the aggregation when you want it to depend on several original columns.

Missing values is a bit more tricky - again, personally, I'm ok with what's essentially a reduce operation failing on missing data, but I can also see the argument for missing data to be ignored. Probably the strategy should be an optional argument. Looking at what ds/descriptive-stats does, it's quite useful to see :n-valid for instance and it would probably be something I'd like to see in the output of group-agg if missing values where to be ignored.

In the end it's about balance between being generic and ease of use. Looking at pandas: df.groupby(['Index','Year'])['PnL'].sum() isn't generic but day to day very handy!

cnuernber commented 3 years ago

Totally agreed. We can come up with a similarly short transform. That syntax is interesting; in that case sum returns a dataset but I imagine in general sum returns a scalar.

cnuernber commented 3 years ago

Totally general I think would mean that colfn is actually dsfn - it takes a dataset and returns a scalar. Then you can name the result column however you like and the arguments would be something like a map of result-column-name->dsfn. The common case for that is something like what you typed where you just want mean or sum across a few columns and there is a 1->1 mapping of result-column to input column.

The python syntax really matches tablecloth's grouped dataset concept.

alex314159 commented 3 years ago

Just looked at tablecloth and they have a very similar process-grouped-data function - duh!

You're right about the generic case, I think the following works:

(defn group-agg [dataset grp-fn aggmap]  ;aggmap is a map {result-column-name -> dsfn}
  (ds/->dataset
    (into [] (for [[id grp] (ds/group-by dataset grp-fn)]
               (into {:group-id id} (for [[c dsfn] aggmap] [c (dsfn grp)]))))))

(defn group-agg-by-column [dataset grp-fn columns col-fn]
  (group-agg dataset grp-fn (into {} (for [c columns] {c #(col-fn (% c))}))))

;Examples:
(group-agg x #(get % "Issuer") {:rows ds/row-count :cols ds/column-count})
(group-agg-by-column x #(get % "Issuer") ["Average-Index-Weight" "Index-Contribution"] dfn/sum)

;and the data exploration workflow becomes quite natural
(-> raw-data
    (clean-data)
    (add-features)
    (group-agg-by-column)

There is the middle pathway where you want a different function for each column, but in that case you likely also want a different name in the output (to indicate that one aggregate is sum and the other stdev for instance). It would be quite verbose anyway so maybe that stays in the first, generic function.

genmeblog commented 3 years ago

You can safely mix tablecloth and tech.ml.dataset with no harm (almost). group-by -> aggregate path is based on what is available in R/Python cases. https://scicloj.github.io/tablecloth/index.html#Aggregate

cnuernber commented 3 years ago

aggregate-columns appears to be very close:

(-> x
    (api/group-by ["Issuer"])
    (api/aggregate-columns ["Index-Contribution" "Average-Index-Weight"] dfn/sum))

@alex314159 - What do you think? It isn't as dense as the python or your example api but it seems to me to be close.

alex314159 commented 3 years ago

yes I think that syntax is quite clean too, I should have spent more time looking at tablecloth's api! Thanks a lot.