probabl-ai / skore

Skore lets you "Own Your Data Science." It provides a user-friendly interface to track and visualize your modeling results, and perform evaluation of your machine learning models with scikit-learn.
https://probabl-ai.github.io/skore/
MIT License
75 stars 7 forks source link

Headaches caused by multiple environments #27

Closed koaning closed 2 months ago

koaning commented 4 months ago

I have thusfar assumed that mandr will only need to worry about a single Python environment. But maybe that is not realistic in the long run. Maybe the single user will keep their Python version up to date which means that we are dealing with updated packages/Python versions. But especially when multiple teams are involved then we are for sure going to have to deal with this.

Since we are dealing with Pickles/multiple environments a bunch ... I am wondering if there is something we can/should do to minimise headackes.

adrinjalali commented 4 months ago

In skops I ended up having to store the versions of dependencies used to produce the model objects, here it translates to having that stored alongside each mandr I think.

And I guess this is only an issue on the "visualizing things from mandr objects" part, rather than "storing stuff" part, and only if we're dealing with python artifacts rather than stored information / logs. We might need to have the webapp / backend create / launch a new environment for each mandr then.

I wouldn't mind letting this be more of an advanced feature for now and assuming same / compatible versions to start with.

koaning commented 4 months ago

Right, so kind of like assuming that all views are static at some point and won't change anymore?

tuscland commented 4 months ago

Am I right to understand this has to do with environment reproducibility? In the future, I see a nice role played by mandr to capture the environment, ensuring it stays consistent over time.

adrinjalali commented 4 months ago

Right, so kind of like assuming that all views are static at some point and won't change anymore?

As long as we don't do realtime interactive dashboard-y stuff, this assumption can be true.

In the future, I see a nice role played by mandr to capture the environment, ensuring it stays consistent over time.

To some extent, but pixi does a very good job for that, so we shouldn't be recreating any tools for that.

tuscland commented 4 months ago

Reading again from the start, to me there are two interesting issues:

How to deal with environment reproducibility? Adrin suggests it is out of our scope.

How to introduce the notion of environment? This has, to me, extents in MLOps. For the moment being, on the storage side, maybe we can just leave users deal with environments at the path level? Something like probabl-ai/dev/my-experiment, or probabl-ai/prod/my-experiment?

Right, so kind of like assuming that all views are static at some point and won't change anymore?

Views can be dynamic as long as the underlying data does not require executing a user-program (as in loading a pickled model). It would be sad if the user could not change the layout of their dashboard, just because we decide the views are computed once for all ;)

This is why we should be careful to separate things that are data —and can be visualized (any serializable data structure containing scalar values), from things that are code (incl. pickles) — and need to be computed in a reproduced environment.

tuscland commented 2 months ago

Revisiting this issue under the light of recent developments.

Since #303, we no longer persist pickles, or at least, it has become an implementation detail. Items stored in a project can only be of types that are independent of external libraries, with the exception of scikit-learn models that are serialized using skops.io.

For environment reproducibility, as Adrin said, an integration with pixi or an environment management tool would be interesting.