techascent / tech.ml.dataset

A Clojure high performance data processing system
Eclipse Public License 1.0
680 stars 35 forks source link

unexpected "distinct" of float columns after categorical->number #414

Closed behrica closed 2 months ago

behrica commented 2 months ago
(->
  (tc/dataset {:x1 [1 2 4 5 6 5 6 7]
               :x2 [5 6 6 7 8 2 4 6]
               :y [:a :b :b :a :c :a :b :d]})
  (ds/categorical->number [:y])
  (get :y)
  distinct)

-> -> (2.0 3 2 1 0)

So we have 2.0 and 2 in the column :y, which seems to be wrong.

see here: https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/unexpected.20.22distinct.22.20of.20float.20columns

cnuernber commented 2 months ago

I think categorical->number should just return a long column and not a floating point column. The long column will auto-convert to reasonable double values if asked specifically for a double representation so breakage to existing systems should be minimal.

harold commented 2 months ago

:+1: