scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
305 stars 27 forks source link

Lift column ops to the dataset level #107

Closed ezmiller closed 1 year ago

ezmiller commented 1 year ago

Goal

This PR builds on the new column API by lifting most of the function in tablecloth.column.api so that they can now be called on the dataset.

There are two signatures that have been lifted here:

  1. (dataset target-column columns-selector) => dataset (shown above)
  2. (datasest columns-selector) => scalar

For #1, you can do this kind of thing

(dscol/* ds :c [:a :b])
;; => _unnamed [4 3]:

| :a | :b | :c |
|---:|---:|---:|
|  1 |  5 |  5 |
|  2 |  6 | 12 |
|  3 |  7 | 21 |
|  4 |  8 | 32 |

For #2, the story is a bit more complicated: There are a number of operator functions in tabelcoth.column.api.operators that return scalars. For example a function like reduce-* or sum. We decided to lift these because they may be useful within aggregation expressions.

However, we realized that lifting them in that way doesn't help make the aggregation expressions more terse. For example, as we've lifted the functions in this PR, we can do something like this:

(-> my-dataset
    (tc/group-by [:x])
    (tc/aggregate (fn [ds] (tcc/mean (ds :y)))))

This expression could be more terse if we didn't need to supply the anonymous function to tc/aggregate. What we are considering doing, therefore, following @genmeblog 's suggestion is to re-lift these fns in another PR so that they can function as aggregator functions themselves. We won't however, do this here, but I wanted to provide this context to make it clear that the second signature here is still a WIP.

Other Changes in this PR

This PR also improves the utility functions that we use to do the lifting. In this PR, we add a new utility file: src/tablecloth/utils/codegen.clj that holds the key functions that we are using here to do lifting and code generation. The key fn there is do-lift that has now been changed to take a lift plan that looks like:

{:target-ns 'tablecloth.api.operators
 :source-ns 'tablecloth.column.api.operators
 :lift-fn-lookup {['+ '- '*] {:lift-fn lift-fn}
 :deps ['tablecloth.api.lift_operators]
 :exclusions '[* + -]}

A key part of this lift plan is the lift-fn-lookup where we specify which lift functions to use for which fn symbols. That map now maps a set of function symbols to a map containing a :lift-fn key and :optional-args that will be passed to that fn:

{ ['even?
   'finite?
   'infinite?
   'mathematical-integer?
   'nan?
   'neg?
   'not
   'odd?
   'pos?
   'round
   'zero?] {:lift-fn lift-op
             :optional-args {:new-args {'[x] {'arg 'x}
                                        '[x options] {'arg 'x}}}}}

This change came from a suggestion by @daslu pointing out that doing it this way is a bit more transparent. Before we were packing the :optional-args inside of functions so they were less available for the kind of analysis @daslu did to look at how we lift fns here.