kleveross / ormb

Docker for Your ML/DL Models Based on OCI Artifacts
Apache License 2.0
452 stars 61 forks source link

Support other stateful ML artifacts like transformers #179

Closed gbolmier closed 3 years ago

gbolmier commented 3 years ago

/kind feature

What happened:

ML models often require stateful transformers to process data for them (e.g. standard scaler). Unfortunately, this kind of artifact isn't supported as of now.

Also some ML frameworks aren't supported, yet? Especially frameworks that don't use specific serialization formats, but rely on e.g. the pickle protocol.

I'm not familiar with OCI stuff or the internals of registries, what's the process and the effort to add support for new frameworks or new serialization formats?

What you expected to happen:

Extended support to broader kinds of ML artifacts.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

gaocegege commented 3 years ago

Hi @gbolmier

ML models often require stateful transformers to process data for them (e.g. standard scaler). Unfortunately, this kind of artifact isn't supported as of now.

Do you mean the model with transformer structure, or some transformation functions to process the data?

Also some ML frameworks aren't supported, yet? Especially frameworks that don't use specific serialization formats, but rely on e.g. the pickle protocol.

https://github.com/kleveross/ormb/blob/master/pkg/model/format.go The format is defined here. You can add a new format pickle.

And, welcome contributions!

gbolmier commented 3 years ago

Hi @gaocegege, thanks a lot for the prompt answer.

Do you mean the model with transformer structure, or some transformation functions to process the data?

I'm referring to the second (e.g. standard scaler, pca, tf-idf vectorizer). These transformers are closely tied to the model, they often have hyperparameters that impact the model's performance and a state updated while processing the training data (like models). The model's performance on unseen data is dependent on the transformers used during the training phase, that's why stateful transformers are persisted to further process unseen data in the same way they processed the training data.

https://github.com/kleveross/ormb/blob/master/pkg/model/format.go The format is defined here. You can add a new format pickle.

And, welcome contributions!

Thanks a lot for the pointer, cool this looks pretty straightforward.

Follow-up question, let's say I want to share and publish some transformers tied to my ML model, do I have to create similar tree structures for each transformer along the model one?

$ tree .
.
├── sklearn_model
│   ├── model
│   │   └── sklearn_model.joblib
│   └── ormbfile.yaml
├── sklearn_transformer_a
│   ├── model
│   │   └── transformer_a.joblib
│   └── ormbfile.yaml
└── sklearn_transformer_b
    ├── model
    │   └── transformer_b.joblib
    └── ormbfile.yaml

6 directories, 6 files

If that's the case, could we make it more convenient in practice?

gaocegege commented 3 years ago

If that's the case, could we make it more convenient in practice?

What's your favorite srtructure? As you know, OCI supports layer-based storage like Docker Image, maybe we could discuss it further.

gbolmier commented 3 years ago

Actually, it's not really the structure which is inconvenient, it's more about writing the ormbfile.yaml artifact config file. I opened a separate issue (kleveross/ormb#180) to discuss this further. I'm closing this one as nothing prevents users to publish other ML artifacts like transformers.