Open oldrichsmejkal opened 3 years ago
@oldrichsmejkal very good idea! I like it.
How would you like to see it implemented? Maybe add a method to AutoML class:
def zip_for_production(self, path_where_to_zip):
# zipping code ...
# ...
@oldrichsmejkal would you like to work on this?
Hi there, I'm just starting to test out mljar and I'm running into a similar requirement.
From what I've read I gather that mljar saves each model in its native format. Is there an official API for getting a reference to the best model instance right after training? I've poked through the docs and available methods/attributes of a trained AutoML instance but may have missed it.
I use MLFlow for model tracking and deployment, so my ideal workflow might look something like:
Thanks for your work on this library!
EDIT: I see that you can get the name and library of the top model with get_leaderboard
which returns a DataFrame:
(Pdb) estimator.get_leaderboard()
name model_type metric_type metric_value train_time
0 3_Default_Xgboost Xgboost logloss 0.556054 21.6
But I'm not seeing an officially supported way to get a reference to that underlying estimator. It looks like it can be found in an unofficial way like this:
(Pdb) estimator._best_model.learners[0].model
<xgboost.core.Booster object at 0x7f3ff7af8250>
(Pdb) estimator._best_model.learners[0].model_file_path
'/somedir/3_Default_Xgboost/learner_fold_0.xgboost'
Hi @totalhack,
The model (for prediction) needs data to be preprocessed in the same way as training data. The information about preprocessing is stored in the framework.json
file inside the model's directory. This file should be added to the final model file.
If your best model is ensemble or stacked model, then you need to save all models that are used to construct ensemble.
If you are using cross-validation then you will have a number of learners equals a number of folds. All of them should be saved/loaded and the final prediction is the average of them.
That's why it is hard to extract only a single file. There is an option to pack all needed files into one zip file. Would it work for you?
Thanks @pplonski, that makes sense. I wasn't thinking about the preprocessing...was picturing things being serialized a bit more like an sklearn pipeline where it can be saved/loaded all together.
Is everything that's needed to recreate the winning "pipeline" stored in the directory of the winning model (ignoring the ensemble scenario)?
What is the procedure for loading a particular model other than copying the model's directory from machine A to machine B? Let's say for example I have one winning model out of 5, I want to just copy that model as an artifact which gets downloaded to some production machine for loading...if I just do AutoML(results_path=...)
on Machine B with only the winning model copied, it will automatically load it and I can just call predict
at that point?
I probably can come up with a custom process for saving the AutoML output directory as an artifact with MLFlow, I just want to make sure I can isolate it to only what I need for an optimized production scenario.
@totalhack, it will require some additional coding. Information about what models should be loaded is stored in the params.json
file in the main AutoML directory. This file should be edited and only needed models should be left there.
I think that there should be added a method that will pack all needed files for production and create a folder/zip that can be moved to other machines and easily loaded.
@pplonski just inspecting params.json:
<snip>
"saved": [
"1_Baseline",
"2_DecisionTree",
"3_Default_Xgboost",
"4_Default_NeuralNetwork",
"5_Default_RandomForest",
"Ensemble"
],
"fit_level": "finished",
"best_model": "3_Default_Xgboost",
"load_on_predict": [
"3_Default_Xgboost"
]
I see load_on_predict just references the best model. So presumably if I copied this file and the "3_Default_Xgboost" directory into the artifact then AutoML would have enough to just load that back up for prediction? I may look into that road as it seems relatively straightforward. The subdirectories still all have lots of extra files relevant only to training but they are pretty small so it's not a big issue.
At first glance this seems to work. I extract params.json, progress.json, data_info.json and the winning model directory as named in the best_model
value in params.json into a separate directory, then use mlflow.log_artifact
to include this with the serialized model. I can then pull in that model using the regular mlflow tools in a separate service.
What might be a nice improvement on this is, aside from having an API in mljar that roughly does what I just described, is to also produce a pip requirements file that only has the libraries required to run the winning model. You'd need to support mljar installs that don't automatically pull in all libraries too of course. That would allow keeping the production install size/timing a bit more lean. The full mljar install comes with a lot of unnecessary stuff for a production environment running a single winning model.
In that context - how would you load an ensemble in another machine? Thanks.
@totalhack I'm running into the same scenario and trying to figure out if I should do something like you or use MLJAR as a "heuristic" to identify best model and then re-train using a more mainstream package - though option 2 becomes problematic with ensemble models. Could you share code of how you did the following? _"At first glance this seems to work. I extract params.json, progress.json, data_info.json and the winning model directory as named in the best_model value in params.json into a separate directory, then use mlflow.logartifact to include this with the serialized model. I can then pull in that model using the regular mlflow tools in a separate service."
Make model ready for production like usage (single compact file).
It can be something simple like filtering only needed files (serialized sub models), zip it. Load model by unzipping submodels in memory etc.