mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.01k stars 401 forks source link

Add option to finalize/serialize model to compact signle file #405

Open oldrichsmejkal opened 3 years ago

oldrichsmejkal commented 3 years ago

Make model ready for production like usage (single compact file).

It can be something simple like filtering only needed files (serialized sub models), zip it. Load model by unzipping submodels in memory etc.

pplonski commented 3 years ago

@oldrichsmejkal very good idea! I like it.

How would you like to see it implemented? Maybe add a method to AutoML class:

def zip_for_production(self, path_where_to_zip):
  # zipping code ...
  # ...

@oldrichsmejkal would you like to work on this?

totalhack commented 2 years ago

Hi there, I'm just starting to test out mljar and I'm running into a similar requirement.

From what I've read I gather that mljar saves each model in its native format. Is there an official API for getting a reference to the best model instance right after training? I've poked through the docs and available methods/attributes of a trained AutoML instance but may have missed it.

I use MLFlow for model tracking and deployment, so my ideal workflow might look something like:

  1. Use mljar to work its magic and find the best library/model to use
  2. Get a reference to the winning model alone (don't want to save/deploy anything extra)
  3. Assuming there is a built-in mlflow flavor for that model type, log the model to mlflow so I can easily distribute it later.

Thanks for your work on this library!

EDIT: I see that you can get the name and library of the top model with get_leaderboard which returns a DataFrame:

(Pdb) estimator.get_leaderboard()
                name model_type metric_type  metric_value  train_time
0  3_Default_Xgboost    Xgboost     logloss      0.556054        21.6

But I'm not seeing an officially supported way to get a reference to that underlying estimator. It looks like it can be found in an unofficial way like this:

(Pdb) estimator._best_model.learners[0].model
<xgboost.core.Booster object at 0x7f3ff7af8250>
(Pdb) estimator._best_model.learners[0].model_file_path
'/somedir/3_Default_Xgboost/learner_fold_0.xgboost'
pplonski commented 2 years ago

Hi @totalhack,

The model (for prediction) needs data to be preprocessed in the same way as training data. The information about preprocessing is stored in the framework.json file inside the model's directory. This file should be added to the final model file.

If your best model is ensemble or stacked model, then you need to save all models that are used to construct ensemble.

If you are using cross-validation then you will have a number of learners equals a number of folds. All of them should be saved/loaded and the final prediction is the average of them.

That's why it is hard to extract only a single file. There is an option to pack all needed files into one zip file. Would it work for you?

totalhack commented 2 years ago

Thanks @pplonski, that makes sense. I wasn't thinking about the preprocessing...was picturing things being serialized a bit more like an sklearn pipeline where it can be saved/loaded all together.

Is everything that's needed to recreate the winning "pipeline" stored in the directory of the winning model (ignoring the ensemble scenario)?

What is the procedure for loading a particular model other than copying the model's directory from machine A to machine B? Let's say for example I have one winning model out of 5, I want to just copy that model as an artifact which gets downloaded to some production machine for loading...if I just do AutoML(results_path=...) on Machine B with only the winning model copied, it will automatically load it and I can just call predict at that point?

I probably can come up with a custom process for saving the AutoML output directory as an artifact with MLFlow, I just want to make sure I can isolate it to only what I need for an optimized production scenario.

pplonski commented 2 years ago

@totalhack, it will require some additional coding. Information about what models should be loaded is stored in the params.json file in the main AutoML directory. This file should be edited and only needed models should be left there.

I think that there should be added a method that will pack all needed files for production and create a folder/zip that can be moved to other machines and easily loaded.

totalhack commented 2 years ago

@pplonski just inspecting params.json:

<snip>
    "saved": [
        "1_Baseline",
        "2_DecisionTree",
        "3_Default_Xgboost",
        "4_Default_NeuralNetwork",
        "5_Default_RandomForest",
        "Ensemble"
    ],
    "fit_level": "finished",
    "best_model": "3_Default_Xgboost",
    "load_on_predict": [
        "3_Default_Xgboost"
    ]

I see load_on_predict just references the best model. So presumably if I copied this file and the "3_Default_Xgboost" directory into the artifact then AutoML would have enough to just load that back up for prediction? I may look into that road as it seems relatively straightforward. The subdirectories still all have lots of extra files relevant only to training but they are pretty small so it's not a big issue.

totalhack commented 2 years ago

At first glance this seems to work. I extract params.json, progress.json, data_info.json and the winning model directory as named in the best_model value in params.json into a separate directory, then use mlflow.log_artifact to include this with the serialized model. I can then pull in that model using the regular mlflow tools in a separate service.

What might be a nice improvement on this is, aside from having an API in mljar that roughly does what I just described, is to also produce a pip requirements file that only has the libraries required to run the winning model. You'd need to support mljar installs that don't automatically pull in all libraries too of course. That would allow keeping the production install size/timing a bit more lean. The full mljar install comes with a lot of unnecessary stuff for a production environment running a single winning model.

dvirginz commented 2 years ago

In that context - how would you load an ensemble in another machine? Thanks.

dbrami commented 1 year ago

@totalhack I'm running into the same scenario and trying to figure out if I should do something like you or use MLJAR as a "heuristic" to identify best model and then re-train using a more mainstream package - though option 2 becomes problematic with ensemble models. Could you share code of how you did the following? _"At first glance this seems to work. I extract params.json, progress.json, data_info.json and the winning model directory as named in the best_model value in params.json into a separate directory, then use mlflow.logartifact to include this with the serialized model. I can then pull in that model using the regular mlflow tools in a separate service."