techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
659 stars 33 forks source link

make `invert-categorical-map` more strict on unknown reverse mapping values #395

Open behrica opened 7 months ago

behrica commented 7 months ago

In order to make categorical mapping related code less brittle, I think we should check and fail in more situations, one is this one:

(require '[tech.v3.dataset.categorical :as ds-cat]
          '[tech.v3.dataset.modelling :as ds-mod]
          '[tech.v3.dataset :as ds])

(def cat-map
  (->
   (ds/->dataset {:a [:x :y]})
   (ds-cat/fit-categorical-map :a)))

(ds-cat/invert-categorical-map (ds/->dataset {:a [0.342 1.6657]})
                               {:src-column :a
                                :lookup-table (:lookup-table cat-map)})

The initial mapping was derived as x -> 1 and y -> 0, but the current code happily maps back 0.342. This should fail in my view, in the same way as other numbers like 3 and 4 fail: " Unable to find src value for numeric value 0.342"

cnuernber commented 6 months ago

Not sure really what to do here. If you had chosen values that do not round to 0 and 1 you would have gotten an exception, perhaps we should use Math/round as opposed to a pure long cast.

behrica commented 6 months ago

This looks error prone to me, but not sure what to fix neither. The below mapping back works due to the long cast

(->(ds/->dataset {:x [:a :b]})
   (ds/categorical->number  [:x])
   :x
   meta
   :categorical-map
   :lookup-table)

;; => {:a 0, :b 1}

|  :x |
|----:|
| 0.0 |
| 1.0 |
behrica commented 6 months ago

I would expect that the above produces a look up map: {:a 0.0., :b 1.0} and that all values except 0.0 and 1.0 would fail when mapping back.

cnuernber commented 6 months ago

The issue there is floating point comparison

behrica commented 4 days ago

This is as well related to the new discussion: https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/invert-categorical-map.20-.20regression.20tests