Closed behrica closed 1 year ago
The code above can be made working (without serializing of the ctx) by:
(def trained-ctx (train))
(def bytes-of-model (-> trained-ctx :model :model-data :model-as-bytes))
;; write bytes-of-models to disk
(ml/pipeline
{:metamorph/id :min-max-scale}
(mm/min-max-scale numerical-cols {})
(mm/categorical->number [target])
(mm/set-inference-target target)
{:metamorph/id :model}
(mm/model {:model-type model-type}))
recreate the needed state before :transform In new JVM
;; read bytes-of-models from disk
(def prediction
(predict-pipeline
{
:metamorph/mode :transform
:metamorph/data (tc/add-column train-data target [nil] :cycle)
:min-max-scale
{:fit-minmax-xform
{:min -0.5,
:max 0.5,
:column-data
{"sepal_length" {:min 4.3, :max 7.9},
"sepal_width" {:min 2.2, :max 4.4},
"petal_length" {:min 1.0, :max 6.9},
"petal_width" {:min 0.1, :max 2.5}}}},
:model {:model-data {:model-as-bytes bytes-of-model}
:options {:model-type :smile.classification/logistic-regression},
:feature-columns ["sepal_length" "sepal_width" "petal_length" "petal_width"],
:target-columns ["species"],
:target-categorical-maps
{"species"
{:lookup-table {"versicolor" 0, "setosa" 1, "virginica" 2},
:src-column "species", :result-datatype :float64}}}
}))
The alternative to this cumbersome setup of :transform is indeed to merge the train-ctx, eventually by serializing it in case of separate JVM invocations
In this concrete case the "state" is perfectly serializable as it is simple data.
Nevertheless we need to keep in mind that any serialisation has certain gaps, so the user need to be carefull with it.
It is probably correct to say, that for any even simple pipeline the manual setup of the state is too cumbersome, so that serializing and deserializing of the ctx is advisable, in case of seperate JVM sessions for train and predict
This is not directly documented, but the in single JVM examples do suggest this.
2 pieces of documentation should be added:
1.) If stateful transformers are used in a multiple JVM sessions setup, each of such steps need to specify a :metamorph/id as this is required for successful serialisation of the ctx (which eases considerably the across JVM usage of pipelines) 2.) Any stateful metamorph transformer should have it's state stored in context as simple Clojure data structures (= EDN), as this eases potential serialisation
So I think your 1) is what I was missing. I did not realise you could override the id of the operations. That is the root of my issue.
I agree that the documentation should make that clear.
Thank you.
It is documented in metamorph:
https://github.com/scicloj/metamorph#context
and here: https://scicloj.github.io/scicloj.ml-tutorials/userguide-advanced.html
What is more usefull is to add a chapter in one of the scicloj.ml tutorials about how to have train
happening in one JVM session and predict in an other JVM session
issue created for better documentation
The following pipeline having a
mm/min-max-scale
cannot made work aving train and predict in two different JVM runs.;; THEN RESTART REPL BEFORE RUNNING THE NEXT FORM