techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
656 stars 33 forks source link

positional arguments in the style of R data.table? #299

Closed aaelony closed 2 years ago

aaelony commented 2 years ago

This is less of an "issue" per se and more of an enhancement request or request for comment (RFC)...

The R data.table library has a really nice, expressive syntax (see Basics 1b) in the form of

DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by

Since tech.ml.dataset already implements this functionality via functions, it would be really cool to adopt the positional syntax of R's data.table via a clojure function call (something like in Nathan Marz's Specter, where the position of the arguments affects the nature of what is output).

The idea here is that there could be a data.table function something akin to:

(data.table i j by)

that would be functionally powerful and expressive, yet a bit less verbose.

Any thoughts?

harold commented 2 years ago

This sound interesting, I find these kinds of abbreviated syntaxes fascinating.

Do you have an example of some tech.ml.dataset code that you think would be improved by a new interface like this one?

It would be inspiring to see a concrete case that looks a lot better using a new (imaginary) interface. Personally, I've written a fair amount of t.m.d code that does SQLish things, and have found Clojure's arrow macros delightful, but I'm certainly biased.

aaelony commented 2 years ago

Certainly, Clojure thread-first and thread-last arrow macros are great! There are others but I don't find myself using them often.

Open to collaboration and suggestions on this positional arguments idea.

Lots of possible ways here, but one approach for a datatable named dt with columns :col-a :col-b :col-c :col-d :col-e :col-fto compute the number resulting rows, the sum of :col-b and the median of :col-c when :col-a is equal to 5 might be:


(let [i ":col-a == 5"
      j "N = .N, sum_of_b = sum(:col-b), median_of_c = median(:col-c)"
      by [:col-e :col-f]
      ]
  (data.table dt i j by))  

I'm not tied to how the i, j, and by are shown in the example above, but that's the general idea.

The data.table R package syntax is highly expressive and useful, so I would take insight from that for more examples. Kind of like einsum notation for matrices that you might find in numpy.

harold commented 2 years ago

A few thoughts:

I'm not sure about your example. It seems i is filtering by a column value, and j is reducing some scalar values, but what is by doing there? Is it sorting, or grouping, or something else? The scalar values from j are all computed commutatively, so sorting wouldn't matter. Under a grouping (nested grouping?) the shape of the return value seems unclear.

Something like this could be implemented on top of t.m.d relatively easy, so you can try these sorts of things out on your own to find where the wins are. I would recommend you try it- I suspect it would be very edifying to learn the necessary extents of the t.m.d api. When there is some advantage, (most likely in economy of expression) integrating it into the library might make sense.

Moreover, I am personally skeptical of embedding the expression language in strings (with commas (?) gasping face emoji). There are already functions to accomplish these things (row-count and descriptive-stats), as well as the other expected operations (filter, sort, group-by). The use of such string sub-languages throws away Clojure's homoiconicity, and that's not something to do lightly, imho.

Do you have a more real-world example of t.m.d code you think could be improved by the use of an R data.table like interface?

aaelony commented 2 years ago

by in data.table syntax groups "by" a column or set of columns when specified.

From the documenation:

a) Grouping using by
– How can we get the number of trips corresponding to each origin airport?
ans <- flights[, .(.N), by = .(origin)]
ans
#    origin     N
# 1:    JFK 81483
# 2:    LGA 84433
# 3:    EWR 87400

## or equivalently using a character vector in 'by'
# ans <- flights[, .(.N), by = "origin"]

The idea is that functionality is already present in the TMD library, but could be "mapped" to this kind of positional syntax.

I'll think about better examples and come up with something when I get a chance as a separate project. Okay to close this ticket.

Thank-you

cnuernber commented 2 years ago

Another angle is if there is a sizeable system or body of code that uses this sublanguage and notation that we could take advantage of that would help a lot.

Your exact example would look something like:

(ds-reduce/group-by-column-agg [:col-e :col-f]
                                     {:N (ds-reduce/row-count)
                                      :sum-of-b (ds-reduce/sum :col-b)
                                      :median-of-c (ds-reduce/prob-median :col-c)}
                                     [(ds/filter-column :col-a 5)])

You would need to understand some level of translation between those mappings as you noted.

aaelony commented 2 years ago

Thanks for that. I concur 100% with the above comments.

As a bit of background, data.table is quite common in the R community. Typically though, folks either prefer the tidyverse or the data.table manner of syntax.

Aside from succinctness and expressivity, data.table in R is often far more performant, so much so actually, that the tidyverse itself offers data.table as a backend for dplyr, if desired. In my estimation, experienced R programmers tend to prefer data.table because as a project grows in complexity and scope, verbosity is not always desirable.

It is useful in aggregation as in the example above, but, aggregation aside, the same positional syntax works equally for (conditional) variable assignment and variable creation. It is quite clever and succinct and the thought is "how to bring that" to an already capable TMD ecosystem if possible.

a few examples in the wild:

NB: it can be hard to detect in R code that doesn't explicitly use the data.table:: prefix in a library. A more comprehensive approach might be to grep the DESCRIPTION files for data.table in R repos.

That said, when I have a good set examples I'll update this thread anew.