scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
288 stars 23 forks source link

Simplify access to row values in adjacent columns #13

Closed ashimapanjwani closed 3 years ago

ashimapanjwani commented 3 years ago

Notice the difference in complexity between the following R vs Clojure code which is used to achieve similar results.

tibble(
  x = 1:5, 
  y = 1, 
  z = x + y
)
'[tech.v3.datatype.functional :refer [+]]

(def ds (tablecloth/dataset [[:x (range 1 6)] [:y 1]]))
(-> ds 
    (tablecloth/add-or-replace-column :z (+ (ds :x) 
                                            (ds :y))))
genmeblog commented 3 years ago

good point! However I don't see an easy solution here. Maybe some macro?

genmeblog commented 3 years ago

maybe something like this?

(with-columns ds [:x :y]
  (-> ds
      (add-or-replace-column :z (+ x y))))
genmeblog commented 3 years ago
(with-columns-> ds [:x :y]
    (add-or-replace-column :z (+ x y)))
genmeblog commented 3 years ago

Or maybe even differently. We can create tibble macro which works this way. Something like this:

(tibble [x (range 5)
         y 1
         z (+ x y))

It acts as let with consecutive column creation which later be packed into a dataset finally. What do you think?

genmeblog commented 3 years ago

What about this?

(let-dataset [x (range 1 6)
              y 1
              z (tech.v3.datatype.functional/+ x y)])
;; => _unnamed [5 3]:
;;    | :x | :y | :z |
;;    |----|----|----|
;;    |  1 |  1 |  2 |
;;    |  2 |  1 |  3 |
;;    |  3 |  1 |  4 |
;;    |  4 |  1 |  5 |
;;    |  5 |  1 |  6 |

(let-dataset [abc (range 10)
              def (range -10 0)
              zzz (tech.v3.datatype.functional/* abc def)]
             {:dataset-name "from macro"})

;; => from macro [10 3]:
;;    | :abc | :def | :zzz |
;;    |------|------|------|
;;    |    0 |  -10 |    0 |
;;    |    1 |   -9 |   -9 |
;;    |    2 |   -8 |  -16 |
;;    |    3 |   -7 |  -21 |
;;    |    4 |   -6 |  -24 |
;;    |    5 |   -5 |  -25 |
;;    |    6 |   -4 |  -24 |
;;    |    7 |   -3 |  -21 |
;;    |    8 |   -2 |  -16 |
;;    |    9 |   -1 |   -9 |
daslu commented 3 years ago

Nice!

Sorry, I haven't thought about it earlier, but maybe let forms themselves are just as good?

(let [abc (range 10)
      def (range -10 0)
      zzz (tech.v3.datatype.functional/* abc def)]
  (dataset {:dataset-name "from macro"}))
genmeblog commented 3 years ago

Well... you have to somehow feed your columns to a dataset. You probably meant:

(let [abc (range 10)
      def (range -10 0)
      zzz (tech.v3.datatype.functional/* abc def)]
  (dataset {:abc abc :def def :zzz zzz} {:dataset-name "from macro"}))
daslu commented 3 years ago

Oh, missed that. Thanks! :)

Oh, now I see why it is actually less verbose this way. That makes sense.

genmeblog commented 3 years ago

The macro itself is just:

(defmacro let-dataset
  ([bindings] `(let-dataset ~bindings nil))
  ([bindings options]
   (let [cols (take-nth 2 bindings)
         col-defs (mapv vector (map keyword cols) cols)]
     `(let [~@bindings]
        (dataset ~col-defs ~options)))))
behrica commented 3 years ago

The "conciseness" of R in this comes at a very high price...

To be able t say "z = x + y" comes at a very high price, the moment you want to "program with dplyr". (make your own functions) https://dplyr.tidyverse.org/articles/programming.html

But here in Clojure the step from the code above to a method where the names of x,y,z are coming in as parameters is very small, while in R it is very big....

I enjoy a lot to work with Clojure + tablecloth because there is no such magic needed.... Of course, an (optional!!) macro for more concise code is a standard pattern in clojure. But this would make the step to "parameterized" variable names as well big.

behrica commented 3 years ago

I think this is very much related to the discussion on "concise" vector arithmetics in Clojure. Macros don't compose, as we say.

But can result in less code.

behrica commented 3 years ago

The original request is similar to asking Clojure to allow this:

(def m {:a 1
             :b 2
             :c  ( + :a :b)])

The idiom to solve this in Clojure is by using let,

(def m  (let [a 1 b 1 c (+ a b)]             
             {:a a :b b :c c}))
behrica commented 3 years ago

And we should not forget that R is vectorized from ground up, basically it has only vectors. Any scalar value is in reality a vector of size 1.

So to compare Clojure with R regarding conciseness of vector arithmetics is unfair comparison.

daslu commented 3 years ago

@behrica I think you're right, that adding a macro introduces additional complexity and less composability.

Another option would be to change the semantic of add-or-replace-column, so that it works in a sequential way, adding one column after another.

In such semantics,

'[tech.v3.datatype.functional :refer [+]]

(tablecloth/dataset [
 [:x (range 1 6)]
 [:y 1]
 [:z #(+ (:x %) (:y %))]])

would simply add the column :x, then :y:, then :z (relying on :x and :y already existing there, thus making that function work).

That seems to address the challenge presented by @ashimapanjwani .

What do you think?

genmeblog commented 3 years ago

For this case (and also for case from this thread https://clojurians.zulipchat.com/#narrow/stream/151763-beginners/topic/handling.20successive.20alterations) I would stay on the let level and then pack everything into a dataset at the end.

@daslu's example above creates ambiguity how dataset is created from various sources. Actually add-or-replace-columns (which new name will be add-columns in the next version, see #16) is actually doing that - the only one small fix is needed. Replace reduce-kv to just reduce here: https://github.com/scicloj/tablecloth/blob/master/src/tablecloth/api/columns.clj#L134 to make the following work.

(tablecloth/add-columns (tablecloth/dataset) [
 [:x (range 1 6)]
 [:y 1]
 [:z #(+ (:x %) (:y %))]])
genmeblog commented 3 years ago
(-> (dataset)
    (add-columns [[:x (range 1 6)]
                  [:y 1]
                  [:z #(tech.v3.datatype.functional/+ (:x %) (:y %))]]))

;; => _unnamed [5 3]:
;;    | :x | :y | :z |
;;    |----|----|----|
;;    |  1 |  1 |  2 |
;;    |  2 |  1 |  3 |
;;    |  3 |  1 |  4 |
;;    |  4 |  1 |  5 |
;;    |  5 |  1 |  6 |
daslu commented 3 years ago

Thanks @genmeblog . Can you explain the comment about ambiguity?

genmeblog commented 3 years ago

Yep! There is some logic behind creating dataset from various data structures. Almost all of them fall into two categories:

For both I use t.m.d ->dataset function. For above case I need to escape this path and actually use add-columns function. I don't know the details behind the scene in ->dataset but logic is quite complicated there and I can't assure the same behaviour for every possible data structure (eg. map vs seq of pairs).

daslu commented 3 years ago

Oh, I see, thanks!

genmeblog commented 3 years ago

introduced let-dataset api function