techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
678 stars 35 forks source link

filter-by #56

Closed genmeblog closed 4 years ago

genmeblog commented 4 years ago

There is an use case where we want to filter based on some column calculations. This calculations can't be done by simple map. For example: we need to filter by (moving) average, or R rank. To flow looks like that:

  1. Take a column(s)
  2. Produce new temporary column (calculate moving average for example)
  3. Filter by this new column, or gather indices and select rows
  4. Drop column

Here is concrete case with rank: https://github.com/genmeblog/techtest/blob/master/src/techtest/datatable_dplyr.clj#L554

I can think about solution of filter (filter-by maybe?) which takes a sequence and selects only rows corresponding to the result of predicate.

cnuernber commented 4 years ago

Something like: (filter-by data dataset) where the truthy indexes parts of data an index to keep. Then you could construct data via more efficient means than iterating through the dataset as a sequence of maps.

That alone (have data calculated efficiently) would close the gap a bit further between R and Clojure in the filter example previously on zulip.

cnuernber commented 4 years ago

Is this addressed by your new API?

genmeblog commented 4 years ago

Yes!

cnuernber commented 4 years ago

Addressed by: https://github.com/scicloj/tablecloth