scicloj / scicloj.ml

A Clojure machine learning library
Eclipse Public License 2.0
214 stars 14 forks source link

min-max transformer does not work in serialized ctx #14

Closed behrica closed 1 year ago

behrica commented 1 year ago

The following pipeline having a mm/min-max-scale cannot made work aving train and predict in two different JVM runs.

(ns iris
  (:require [clojure.java.io :as io]
            [taoensso.nippy :as nippy]
            [scicloj.ml.core :as ml]
            [scicloj.ml.metamorph :as mm]
            [tablecloth.api :as tc]
            [tech.v3.dataset :as ds]))

(def numerical-cols ["sepal_length", "sepal_width", "petal_length", "petal_width"])

(def target "label")
(def labels ["Iris-versicolor" "Iris-setosa" "Iris-virginica"])

(def dataset (-> "iris.csv"
                 (ds/->dataset)))

(def splits (-> (tc/split->seq dataset
                               :holdout
                               {:seed        123
                                :ratio       [0.7 0.3]
                                :split-names [:train :test]})
                (first)))

(def train-data (:train splits))
(def test-data (:test splits))

(def model-type :smile.classification/logistic-regression)

(def pipeline (ml/pipeline
                (mm/min-max-scale numerical-cols {})
                (mm/categorical->number [target])
                (mm/set-inference-target target)
                {:metamorph/id :model}
                (mm/model {:model-type model-type})))

(defn train []
  (-> (pipeline {:metamorph/data train-data
                 :metamorph/mode :fit})
      (dissoc :metamorph/data)
      (update-in [:model :model-data] dissoc :smile-df-used)))

(defn predict [dataset ctx]
  (-> ctx
      (assoc :metamorph/data (tc/add-column dataset target [nil] :cycle)
             :metamorph/mode :transform)
      (pipeline)
      :metamorph/data
      (ds/drop-columns [target])))

(defn load-ctx [path]
  (-> path
      (io/resource)
      (nippy/thaw-from-file)))

;; RUN THIS FIRST

(comment
  (nippy/freeze-to-file "iris.nippy" (train)))

;; THEN RESTART REPL BEFORE RUNNING THE NEXT FORM


(comment
  (let [ctx (load-ctx "iris.nippy")]
    (predict test-data ctx)))
behrica commented 1 year ago

see as well: https://clojurians.zulipchat.com/#narrow/stream/283491-scicloj.2Eml-dev/topic/Serialised.20context.20does.20not.20work.20across.20REPL.20sessions

behrica commented 1 year ago

The code above can be made working (without serializing of the ctx) by:

  1. get bytes from model:
(def trained-ctx (train))
(def bytes-of-model (-> trained-ctx :model :model-data :model-as-bytes))
;; write bytes-of-models to disk
  1. Assigning a fixed id to the min-max step:
(ml/pipeline
   {:metamorph/id :min-max-scale}
   (mm/min-max-scale numerical-cols {})
   (mm/categorical->number [target])
   (mm/set-inference-target target)
   {:metamorph/id :model}
   (mm/model {:model-type model-type}))
  1. recreate the needed state before :transform In new JVM

    ;; read bytes-of-models from disk
    
    (def prediction
    (predict-pipeline 
    {
    :metamorph/mode :transform
    :metamorph/data (tc/add-column train-data target [nil] :cycle)
    :min-max-scale
    {:fit-minmax-xform
     {:min -0.5,
      :max 0.5,
      :column-data
      {"sepal_length" {:min 4.3, :max 7.9},
       "sepal_width" {:min 2.2, :max 4.4},
       "petal_length" {:min 1.0, :max 6.9},
       "petal_width" {:min 0.1, :max 2.5}}}},
    :model {:model-data {:model-as-bytes bytes-of-model}
            :options {:model-type :smile.classification/logistic-regression},
    
            :feature-columns ["sepal_length" "sepal_width" "petal_length" "petal_width"],
            :target-columns ["species"],
            :target-categorical-maps
            {"species"
             {:lookup-table {"versicolor" 0, "setosa" 1, "virginica" 2}, 
              :src-column "species", :result-datatype :float64}}}
    
    }))

    The alternative to this cumbersome setup of :transform is indeed to merge the train-ctx, eventually by serializing it in case of separate JVM invocations

In this concrete case the "state" is perfectly serializable as it is simple data.
Nevertheless we need to keep in mind that any serialisation has certain gaps, so the user need to be carefull with it.

behrica commented 1 year ago

It is probably correct to say, that for any even simple pipeline the manual setup of the state is too cumbersome, so that serializing and deserializing of the ctx is advisable, in case of seperate JVM sessions for train and predict

This is not directly documented, but the in single JVM examples do suggest this.

behrica commented 1 year ago

2 pieces of documentation should be added:

1.) If stateful transformers are used in a multiple JVM sessions setup, each of such steps need to specify a :metamorph/id as this is required for successful serialisation of the ctx (which eases considerably the across JVM usage of pipelines) 2.) Any stateful metamorph transformer should have it's state stored in context as simple Clojure data structures (= EDN), as this eases potential serialisation

hjrnunes commented 1 year ago

This is the iris dataset for the code above.

iris.csv

hjrnunes commented 1 year ago

So I think your 1) is what I was missing. I did not realise you could override the id of the operations. That is the root of my issue.

I agree that the documentation should make that clear.

Thank you.

behrica commented 1 year ago

It is documented in metamorph: https://github.com/scicloj/metamorph#context and here: https://scicloj.github.io/scicloj.ml-tutorials/userguide-advanced.html What is more usefull is to add a chapter in one of the scicloj.ml tutorials about how to have train happening in one JVM session and predict in an other JVM session

behrica commented 1 year ago

issue created for better documentation