openml / discussionboard

A repository that hosts the Github Discussions for the OpenML organization.
0 stars 0 forks source link

How to serialize models #11

Closed joaquinvanschoren closed 1 year ago

joaquinvanschoren commented 8 years ago

More of a developer-to-developer question: we are working on exporting scikit-learn runs, but we are unsure what is the best way to share learned models. On first sight, creating a pickle is the best and most general way to go. Matthias confirms that this works with scikit-learn SVMs, even though the files can get large for large datasets.

However, scikit-learn recommends to use joblib because it is more efficient: http://scikit-learn.org/stable/modules/model_persistence.html

The problem here is that it creates a bunch of files in a folder. This is much harder to share, and sending many many files to the OpenML server for every single run seems unwieldy and error-prone.

Would creating a single pickle file still be the best way forward, or is there a better solution?

zardaloop commented 8 years ago

I guess in joblib you can use the compress option to make a single file https://pythonhosted.org/joblib/persistence.html Would that answer your question ?

zardaloop commented 8 years ago

However by reading this article (https://pythonhosted.org/joblib/generated/joblib.dump.html) I don't think the parameter is boolean, instead it is an integer between 0 to 9.

joaquinvanschoren commented 8 years ago

Ah, that looks really useful.

I did notice that joblib pickles are not supported across python versions. That means that if someone built a scikit-learn model with Python 2 it cannot be loaded by someone running it on Python 3? Should we worry about that or can it be easily solved?

zardaloop commented 8 years ago

Where did you read that?

joaquinvanschoren commented 8 years ago

@zardaloop On the bottom of the link you posted :) https://pythonhosted.org/joblib/persistence.html

zardaloop commented 8 years ago

Well I guess you really need to rethink about this . Because joblib is only for local storage and that's all about it. Eeven scikit-learn to be able to rebuild a model with its future version needs additional metadata along with the pickled model which contains : The training data, e.g. a reference to a immutable snapshot The python source code used to generate the model The versions of scikit-learn and its dependencies The cross validation score obtained on the training data http://scikit-learn.org/stable/modules/model_persistence.html

zardaloop commented 8 years ago

Therefore as Matthias recomended I also think pickle is your best bet. But you need to make sure to include the metadata along with the pickled model so it can work in the future version of scikit-learn 😊

mfeurer commented 8 years ago

I'm not sure if it's possible to easily read pickles which were done with python2 in python3 and vice versa. Given that python2 is getting less and less used, one might think of not supporting it at all.

Besides that, @zardaloop has a valid point that storing sklearn models is not that easy and I don't think sklearn has a common way to solve this issue except storing all metadata as @zardaloop suggested. We should have a look at this in the new year.

amueller commented 8 years ago

I think joblib will do single-file exports soon. Maybe for the moment pickle is enough. Be sure to use the latest protocol of pickle, because the default results in much larger files (at least in python2, not sure about python3).

Both joblib and pickle have the issue that they serialize a class, without the corresponding class definition. So it is only guaranteed that a model will work and give the same result when using the exact same code it was created with. We try to keep conflicts in loading to a minimum, but the trees frequently change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to use docker containers or similar virtual environments (conda envs might be enough) with the exact same version of everything.

What is your exact use case? A big question is whether you want the model to "work" or want the exact same results. Changing the numpy or scipy version, or changing the BLAS, might give different results. So If you want the exact same results, that's hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn version is sufficient.

Even if the learning of a model, and therefore the serialization didn't change between versions, it could be that a bug in the prediction code was fixed. So even if you can load a model from an older version, it is not ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these discussions, I don't generally follow the tracker atm, but I'm happy to give input.

joaquinvanschoren commented 8 years ago

Thanks, Andreas, for your valuable input. When it comes down to sharing the model itself, it is sufficient that it just works (will be able to give the same predictions given the same instances). It seems then that storing the exact scikit version in the run, and storing the pickle, it the most workable solution.

The reproducibility discussion is equally important though, and we should look into this when sharing flows. We are currently thinking of storing just the python script that creates the model given a task, with meta-information such as the scikit-learn version, but a docker container would be a better solution (and we are exploring the same thing for R right now). We could generate those for each major scikit-learn version? Do you have experience with this in the scikit-learn team?

On Tue, Dec 29, 2015 at 5:15 PM Andreas Mueller notifications@github.com wrote:

I think joblib will do single-file exports soon. Maybe for the moment pickle is enough. Be sure to use the latest protocol of pickle, because the default results in much larger files (at least in python2, not sure about python3).

Both joblib and pickle have the issue that they serialize a class, without the corresponding class definition. So it is only guaranteed that a model will work and give the same result when using the exact same code it was created with. We try to keep conflicts in loading to a minimum, but the trees frequently change their internal representation.

To make sure a result is entirely reproducible, the "easiest" way is to use docker containers or similar virtual environments (conda envs might be enough) with the exact same version of everything.

What is your exact use case? A big question is whether you want the model to "work" or want the exact same results. Changing the numpy or scipy version, or changing the BLAS, might give different results. So If you want the exact same results, that's hard to achieve without some very controlled environment.

If you want to load a model that "works", having the same scikit-learn version is sufficient.

Even if the learning of a model, and therefore the serialization didn't change between versions, it could be that a bug in the prediction code was fixed. So even if you can load a model from an older version, it is not ensured that you get similar predictions.

Hope that helps. This is a tricky issue. Feel free to ping me on these discussions, I don't generally follow the tracker atm, but I'm happy to give input.

— Reply to this email directly or view it on GitHub https://github.com/openml/python/issues/21#issuecomment-167823391.

mikecroucher commented 8 years ago

The best I can do at the moment is to offer advice on what not to do. Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure that out myself.

joaquinvanschoren commented 8 years ago

Interesting as that blog post is, do we really have an alternative right now? A library like scikit-learn could likely come up with something better, but expecting this for everyone running ML experiments in Python seems a tall order?

Incidentally, what causes pickles to break? Will they still break if one also provides a docker container with an environment in which they work?

Practically speaking, for the experiments that I want to run now, is it ok to use pickle until something better comes along?

On Wed, Dec 30, 2015 at 11:31 AM Mike Croucher notifications@github.com wrote:

The best I can do at the moment is to offer advice on what not to do. Don't use pickle!

Here's a summary as to why

http://eev.ee/blog/2015/10/15/dont-use-pickle-use-camel/

I'm not sure what one should use instead though...still trying to figure that out myself.

— Reply to this email directly or view it on GitHub https://github.com/openml/python/issues/21#issuecomment-167974433.

amueller commented 8 years ago

A library like scikit-learn could likely come up with something better

If you think that, you overestimate our resources by a lot. We haven't been able to provide better backward compatibility, even with pickle.

When it comes down to sharing the model itself, it is sufficient that it just works (will be able to give the same predictions given the same instances)

Well given the same predictions given the same instances can really only be guaranteed with a full container (because of blas issues etc). If you system is reasonably static, storing the scikit-learn version will work as an intermediate solution. But once your hosting provider upgrades their distribution, you might be in trouble. A conda environment is reasonably save, I think.

We haven't done docker containers for reproducibility. We use travis and circleci and appveyor for continuous integration. But we don't really have a need to create highly reproducible environments.

amueller commented 8 years ago

I think pickle or joblib or dill + conda is the best solution for now, with pickle or joblib or dill + conda + docker the optimum upgrade.

drj11 commented 8 years ago

@mikecroucher asked me to comment. I'm a Python old-hand, but know nothing of scikit-learn, so what I have to say is slanted more towards generic Python advice.

To be able to answer a question like "is pickle adequate" we have to be able to pin down some requirements. For example, is it required that:

I would guess that various people would want all of these in some combination, so the real issue is how much do you want to pay (in money, time, and tears) for each of these things.

Additionally, there are various semantic issues. For example: I might be able to load the model, but it gives different predictions, but the predictions are different only in ways that are unimportant (for example, a few ULP). @amueller seems to be aware of these.

With that in mind, pickle is terrible for all of those requirements except basic persistence. Loading a pickle runs arbitrary code, so you should never download and open a pickle. Pickles are extremely brittle (many reasons, but for example, they refer to classes by their module location, so if you reorganise your files for an internal class, everything breaks), so are next to useless for providing forwards or backwards compatibility.

zardaloop commented 8 years ago

@amueller and @drj11 Many thanks for the great input to the issue here. So I guess Dill + Conda seems to be the best possible option available, which I personally really like the approach if I am understanding it correct. So Andreas just to be clear about what you are suggesting here regarding the use of Dill + Conda, do you mean serialise the scikit-learn result object into a file using dill and then making a Conda package by including scikit-learn metadata as well as the serialised file and any other files which will be needed to be able to rebuild the model?

mfeurer commented 8 years ago

Conda seems to be a good idea to persist an environment. I'm not sure about Dill though. From the github website it seems like there is only a single developer/maintainer. We should keep that in mind if we want to base the python interface on that package.

joaquinvanschoren commented 8 years ago

I think @amueller meant (pickle or joblib or dill) + conda, so instead of dill, pickle or joblib could also be used. I think that they all have the same problem that @drj11 mentions, though? Does joblib also execute arbitrary code?

@drj11, do you think that Conda mitigates the other problems that you mentioned (about software versions)?

amueller commented 8 years ago

yes, dill and joblib also execute arbitrary code. Though I don't think that there are security concerns here, as we/you will be creating the pickles, right? People won't be able to upload their own, right?

joblib and dill build on pickle, btw.

And for conda I mean create a conda virtual environment, build a model, store the model using scikit-learn, and also store the complete conda config (all versions, which are binary versions!). Then, you can recreate the exact same conda environment later using the conda config file, and load the model (using pickle or joblib or dill).

amueller commented 8 years ago

also thanks @drj11. Scikit-learn doesn't support anything but basic persistence currently.

The issue is that "version of the software" is a combination of scikit-learn, atlas, numpy, scipy, python, the bit-ness and the operating system. And it is hard to say which changes in the non-scikit-learn parts will lead only to ULP issues vs qualitative differences. Numeric computations, wohoo!

Using conda gives you at least fixed binaries for the libraries, and if we only share between OpenML servers, the OS will be pretty fixed, too.

joaquinvanschoren commented 8 years ago

Thanks @amueller. Note that OpenML allows you to run your algorithms locally (or using any remote hardware you like), and then submit your results through the API. Otherwise it would not scale. Hence the OS can differ for different users. Does this complicate things for conda?

The pickles/joblibs/dills would indeed be created by the openml module (code that we provide and that does the interfacing with the OpenML API). In theory you could overwrite the module and in a contrived way link bonafide predictions to malicious code (in clear violation of the terms of use). To check that we could test the models, e.g. in a sandboxed environment, on the server. However, I don't think that this kind of attack is very likely, as OpenML is a collaboration tool: I will typically only reuse models of people that I am collaborating with, or that I trust as a researcher in good standing.

I like the pickle/joblib/dill + conda approach, and it is likely the best thing to do right now. Some other ML libs have their own model format, e.g. Caffe (http://caffe.berkeleyvision.org/model_zoo.html), which is safer, but as a general approach I think it will work fine.

drj11 commented 8 years ago

Just FTR since I was asked: I don't know enough about conda to have a reliable opinion, but if it can be used to record all versions of all software in use (as @amueller suggests), then that's a good start.

amueller commented 8 years ago

@joaquinvanschoren Ok, if people can submit their models, then you would need them to use conda and submit their conda environment config with the model. That is not terribly hard and probably the most feasible way.

There might still be minor differences due to OS, but the only way to avoid those is to have every user work in a virtual machine (or docker container) and provides the virtual machine with the model. That is way more complicated, and probably not worth the effort.

@drj11 conda is basically a cross-platform package manager that ships binaries (unlike pip), mostly for python and related scientific software.

amueller commented 8 years ago

btw, you might be interested in reprozip and hyperos which are two approaches to create reproducible environments (but they are kinda alpha-stage iirc). Conda or docker seem the better choices for now. One downside of conda is that it does not necessarily capture all dependencies.

If someone wrote a custom transformer (which probably most interesting models have), you have some code part that is not a standard package. So in addition to the environment config you get from conda, and the state you get from pickle, you also need to have access to the source of the custom part.

asmeurer commented 8 years ago

@zardaloop has asked me to comment here. I am not very familiar with the situation so my comment will be generic. I don't have much experience with serialization, so I can't comment on that. As for creating a conda package, I can tell you that it is a good fit if the packaged files are read-only and can be installed to a location in the library prefix (the conda environment). If this is not the case, then conda packages are not a good fit.

joaquinvanschoren commented 6 years ago

It would be great to rekindle this discussion, because it looks like it was converging towards a good solution, and storing models in OpenML would be very useful.

Would a conda+joblib/dill/pickle approach work? Even if it covers a large percentage of use cases it would make many people happy :) @ameuller What do you think of reprozip and hyperos 2 years later?

mfeurer commented 6 years ago

Another thing I'd like to mention: security. Pickle is a bit insecure and I am very hesitant putting a solution based on pickle in the python package. See here.

rizplate commented 6 years ago

+1

janvanrijn commented 6 years ago

and storing models in OpenML would be very useful.

Is there any (scientific or praktical) use case in which storing models becomes relevant? The only thing that I can think of is when a new test set of data becomes available, the model can be reevaluated on this. However, this unfortunately rarely happens.

joaquinvanschoren commented 6 years ago

I agree it is challenging, but I would really love to track the models I'm building. Maybe not during a large-scale benchmark, but there are plenty of other cases where I either want to look at the models to better understand what they are doing or share them so that other people may learn from them and reuse them.

joaquinvanschoren commented 6 years ago

I recently talked to Matei (MLflow). They use a simple format which is just a file containing the model (could be a pickle) and some meta-data on how to read it in.

It is probably best to leave this to the user. The python API should just retrieve the file and meta-data to tell the user what to do with it. Reading in models will probably be done rather occasionally.

mfeurer commented 6 years ago

One more thing to keep in mind is file size. Running the default scikit-learn random forest on the popular EEG-Eye-State dataset (1471) results in 7.5MB:

In [1]: %paste
import openml
import sklearn.ensemble
import pickle
data = openml.datasets.get_dataset(1471)
X, y = data.get_data(target=data.default_target_attribute)
rf = sklearn.ensemble.RandomForestClassifier()
rf.fit(X, y)
string = pickle.dumps(rf)
len(string) / 1024. / 1024.

Out[1]: 7.461672782897949

The most popular task on that dataset has ~85k runs, assuming that only 1 percent of these are random forests, that would require at least 6.3GB. If you would increase the tree size from 10 trees to something reasonable, this space requirement would grow drastically.

rquintino commented 6 years ago

Hi everyone! thinking a lot on this issue these past days, slightly more related to operationalization, pipeline reuse (ex: eval), retraining, & complete reproducibility. Remembered from Joaquin that this was an hot question for openml, & this was a great read/help!

I'm perfectly aware of the security implications and overal versioning issues of loaded resources, but even so, really pipelines solve so much of the issues that were bothering me.... (if only they could be slightly easier to work with :) )

Adding one additional problem, like mentioned above by @amueller , customtransformers. if we have to track the actual code for these... hard to see how can this could be proper operationalized. (and very error prone)

I did some tests with cloudpickle (dill probably will do similar?), and it seems to persist everything that is needed. No need to save/track any customtransformer code. Can load multiple pipelines with no problem. Everything seems really straightforward, save pipeline, (on new kernel) load, predict, refit, just works. Huge flexibility, ex: eval on new refits.

I also did some experiments on mixing sequential preparation flow, but in a fit/transform compatible way.... (sample below, or you can test in binder here https://mybinder.org/v2/gh/DevScope/ai-lab/master?filepath=notebooks%2Fdeconstructing%20-pipelines )

(seems to good to be true... what do you think? ps-anyone knows if the actual code is "recoverable" from the saved cloudpickle?) thanks!

class FitState():
    def __init__(self):
        pass

class PrepPipeline(BaseEstimator, TransformerMixin):

    def __init__(self,impute_age=True,impute_cabin=True,
                 add_missing_indicators=True,
                 train_filter="",copy=True,notes=None):
        self.impute_age=impute_age
        self.notes=notes
        self.copy=copy
        self.train_filter=train_filter
        self.impute_cabin=impute_cabin
        self.add_missing_indicators=add_missing_indicators

    def fit(self, X, y=None):
        print("Fitting...")
        self.fit_state=FitState()
        self.prepare(X=X,y=y,fit=True)
        return self

    def transform(self, X,y=None):
        assert isinstance(X, pd.DataFrame)
        print("Transforming...")
        return self.prepare(X=X,y=y,fit=False)

    def show_params(self):
        print("fit_state",vars(self.fit_state))
        print("params",self.get_params())

    # Experiment is reduce class overhead, bring related fit & transform closer, no models without pipelines
    def prepare(self,X,y=None,fit=False):
        print(f"Notes: {self.notes}")

        fit_state=self.fit_state
        if (self.copy):
            X=X.copy()

        # Fit only steps, ex: filtering, drop cols 
        if (fit):
            # Probably a very bad idea... thinking on it...
            if self.train_filter:
                X.query(self.train_filter,inplace=True)

        if (self.add_missing_indicators):
            if fit:
                fit_state.cols_with_nas=X.columns[X.isna().any()].tolist()
            X=pd.concat([X,X[fit_state.cols_with_nas].isnull().astype(int).add_suffix('_missing')],axis=1)

        # A typical titanic prep step (grabbed few ones from kaggle kernels)   
        if (self.impute_age):
            if fit:
                fit_state.impute_age=X.Age.median()

            X.Age.fillna(fit_state.impute_age,inplace=True)

        # Another one
        if (self.impute_cabin):
            if fit:
                fit_state.impute_cabin=X.Cabin.mode()[0]
            X.Cabin.fillna(fit_state.impute_cabin,inplace=True)

        return X

prep_pipeline=PrepPipeline(impute_age=True,impute_cabin=True, copy=False,train_filter="Sex=='female'",notes="test1")
X=prep_pipeline.fit_transform(df_full.copy())
prep_pipeline.show_params()
print(X.info())
rquintino commented 6 years ago

Similar concept, using dill https://www.analyticsvidhya.com/blog/2017/09/machine-learning-models-as-apis-using-flask/

rquintino commented 6 years ago

ps-lik mentioned above,probably the size and amount of runs will be a challenge for openml, nevertheless really interesting that, when saving the full pipelines (complete flow with all prep/model), we can refit/predict with new train/test folds at any time, ex: refresh leaderboard.

noting that if the pipeline was really a gridsearch fit, then refitting would be rather expensive. :)

rth commented 6 years ago

For serialization, onnx format might also be relevant (cf https://github.com/onnx/onnxmltools)

PGijsbers commented 1 year ago

@mfeurer I suggest we archive this in the broader OpenML discussion board

mfeurer commented 1 year ago

I fully agree on this, will you do so?

PGijsbers commented 1 year ago

Please leave any further comments on this issue in the related OpenML Discussion thread.