scicloj / tablecloth

Dataset manipulation library built on the top of tech.ml.dataset
https://scicloj.github.io/tablecloth
MIT License
288 stars 23 forks source link

verify if ds metadata propagates after TC operations #160

Open genmeblog opened 1 month ago

genmeblog commented 1 month ago

https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/meta.20docs/near/456049819

Jacob-Kroeze commented 1 month ago

This example shows how adding column descriptions might work.

(let [ds' (tc/dataset {:foo ["bar" "baz"]
                         :bar ["baz" "flimflam"]}
              {:dataset-name "this is my data"})
        docs {:foo "A nice colum"
              :bar "The favorite"}
        col-name #(-> % meta :name)]
    (-> ds'
      (tc/update-columns :all (fn [col] (vary-meta col assoc :description (-> col col-name docs))))
      (->> (map #(-> % val meta)))))

=>
({:categorical? true, :name :foo, :datatype :string, :n-elems 2, :description "A nice colum"}
 {:categorical? true, :name :bar, :datatype :string, :n-elems 2, :description "The favorite"})

We can't use vary-meta! b/c columns aren't vars.

I found documented that tmd dataset attempts to preserver metadata on columns, so I think any destruction would be considered a bug (though not officially). Here is a link to the tmd guide about column metadata https://github.com/techascent/tech.ml.dataset/blob/7dbda7ab2923521d298c1be3d257b3563b4f1efc/topics/200-quick-reference.md?plain=1#L47

I found tmd sets metadata on columns in col-impl/new-column but I found there's no clear path to pass column metadata from, for example ds/->dataset or ds/dataset

Here you can see the metadata is on a new var

(let [ds' (tc/dataset {:foo ["bar" "baz"]
                           :bar ["baz" "flimflam"]}
                {:dataset-name "this is my data"})
          docs {:foo "A nice colum"
                :bar "The favorite"}
          col-name #(-> % meta :name)]
      (def new-var-ds' (-> ds'
                         (tc/update-columns :all (fn [col]
                                                   ;(println "var?: " (var? col) "type: " (type col))
                                                   (vary-meta col assoc :description (-> col col-name docs))
                                                   )))))
    (->>  new-var-ds'
      (map #(-> % val meta)))
;=>
({:categorical? true, :name :foo, :datatype :string, :n-elems 2, :description "A nice colum"}
 {:categorical? true, :name :bar, :datatype :string, :n-elems 2, :description "The favorite"})

... And after some operations the metadata :description is gone

(-> new-var-ds'
      (tc/update-columns :all (partial map #(str % "!!!!")))
      (->> (map #(-> % val meta))))