sklearn-clj should handle categorical variables better

behrica commented 2 years ago

They should be re-encoded in the "predict" method. See here

https://github.com/scicloj/scicloj.ml/issues/10

for discussion and a concrete use case

behrica commented 2 years ago

This will not be fixed quickly. The handling of categorical variables in scicloj.ml is in general tricky. See one discussion for example here:

https://clojurians.zulipchat.com/#narrow/stream/236259-tech.2Eml.2Edataset.2Edev/topic/categorical.20maps.20-.20datatype.20.3F

The existing code for it is as well split between "tech.v3.dataset" and "scicloj.ml.smile".

I fear as well that each model could behaves slightly different for categorical variables. Example:

If a model gets trained with target column values being "int" (1 0 1 1 0 1 0).

will it the "for sure" predict "ints" or "might" it predict "floats" (1.0 0.0 1.0)

What does this mean for implementing a reliable and uniform "transform and inverse-lookup" of "strings" into "numeric" ? Is it even guaranteed that in python sklearn all models behave the same in this ?

behrica commented 2 years ago

Additionally we have the "python->JVM" transformation of values in sklearn-clj. I am not sure, how those behave regarding the precise datatypes (again float vs int vs double)

Currently I do the simplest thing to convert the "prediction" of the python model to JVM, just relying on libpython-clj automatic conversion:

https://github.com/scicloj/sklearn-clj/blob/201673f688e58d63791af53d93f4031a7ee2ec7c/src/scicloj/sklearn_clj.clj#L169

Whatever this returns will be the final "column" of the dataset

ash-mcc commented 2 years ago

What does this mean for implementing a reliable and uniform "transform and inverse-lookup" of "strings" into "numeric" ?

I just wanted to feedback that - even without a general fix (at least, for now) - there is nothing to stop users (like me) from inserting their own column-values->categorical which will work in their very specific context. So metamorph's model-agnostic pipelining still provides value :-)

My example of a hacked column-values->categorical...

(defn column-values->categorical 
  "A derivative of tech.v3.dataset.modelling/column-values->categorical,
   which contains a workaround for issue ( https://github.com/scicloj/sklearn-clj/issues/3 )
   of sklearn-clj's handling of categorical values.
   This workaround is for a very specific context but, 
   it shouldn't be used in another context without testing/modification."
  [dataset src-column in-play-model]
  {:pre [(some? src-column)
         (keyword? (-> in-play-model :options :model-type))
         (map? (-> in-play-model :target-categorical-maps (get src-column) :lookup-table))]}

  (if (= :sklearn.classification/random-forest-classifier
         (-> in-play-model :options :model-type))

    ;; the new, hacky path which deals with an sklearn model
    (do 
      (log/info "using a hacked ->categorical to workaround the issue https://github.com/scicloj/sklearn-clj/issues/3")
      (let [lookup-table (-> in-play-model :target-categorical-maps (get src-column) :lookup-table set/map-invert)
            reverse-lookup-table (set/map-invert lookup-table)]
        (->> (ds/rows dataset :as-maps)
             (map (fn [row-as-a-map] 
                    (-> row-as-a-map
                        (get src-column)
                        int
                        reverse-lookup-table))))))

    ;; the original path
    (ds-mod/column-values->categorical dataset src-column)))

behrica commented 2 years ago

That's indeed the adhoc-solution. The table metadata contains indeed the mapping, so you can code a case specific (=model specific) inverse-mapping, observing what the model does return.

This code is as well future proof, because the table metadata will not change in future versions of scicloj.ml.

behrica commented 2 years ago

I think it should be rather easy to change sklearn-clj to make your original code work, so the line:

(ds-mod/column-values->categorical :species)

doing the right think.

I just need to remember in some form the originally created mapping (created in your pipeline via (ds-mm/categorical->number cf/categorical) and re-attach it as column metadata into the prediction result.

The strange thing to me is, that @cnuernber has not done this for the Smile models,I believe. He "re-creates" the mapping table by using the "probability-distribution" matrix of the prediction result and I think assuming a certain order of the columns in that table, one column per potential class. This assumes that the "mapping" is one in a certain way .

"if all goes well", the 2 mappings are identical.... But what , if not... I see there a high risk, that this "wrongly re-maps" in certain edge cases, and the user would not notice, because he will just get a "wrong prediction".

ash-mcc commented 2 years ago

I hear your concern over the mapping used in SMILE (but I know nothing about its inner workings). Just to say that, categorical values with an XGBoost model do seem to be handled ok in pipeline. (Also, it'd be great if I didn't have to use my point-solution!)

behrica commented 2 years ago

@ash-mcc Could you test with commit b1b48d61a648633bd18c5740d615c2a41a3cdebd ?

I took you initial code and added it as a test. It seems to work.

ash-mcc commented 2 years ago

@behrica commit b1b48d6 works well for me.

I removed my point solution for column-values->categorical, went back to using the one from the namespace tech.v3.dataset.modelling ...and it worked well in my application.

Thank you very much for fixing this - Ash

scicloj / sklearn-clj

sklearn-clj should handle categorical variables better #3