mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
2.99k stars 400 forks source link

Import result into Scikit-learn #481

Open AlexanderZender opened 2 years ago

AlexanderZender commented 2 years ago

Hello,

I have some questions about how to import the resulting models of the MLJAR process into scikit-learn.

-1. Do I have to manually extract the configuration of the framework.json parameters of each model into their corresponding counterpart?

-2. If that is the case, is a new training required of the models after importing them into scitkit learn?

-3. Same question as 1. but related to the ensemble result. I'm quite unsure how that one would be imported into scikit-learn?

Best regards Alex

pplonski commented 2 years ago

Hi @AlexanderZender,

  1. Yes, you will need to manually extract information from framework.json, but not for models but for preprocessing (if it was applied). Models will be loaded from model files.
  2. It depends. If you load trained models, then you dont need to train them again. If you would like to reproduce the training from MLJAR AutoML then you will need to get hyperparameters from framework.json file and train the model from scratch.
  3. Ensemble model in MLJAR AutoML is just the weighted average of predictions from models in the ensemble. You can simply take weights of each model and their predictions and compute the final ensemble prediction in one equation.

@AlexanderZender why do you need to move models to pure scikit-learn? There will be possible to automatically generate scikit-learn pipelines for models trained in MLJAR AutoML but I don't like the preprocessing that is in scikit-learn, because it doesn't handle edge cases - just throw errors.

AlexanderZender commented 2 years ago

Thank you @pplonski,

I was positively surprised by how quickly I received your reply! :)

To answer your question: I'm interested to move it to its base library for the realization of OMA-ML, the link to our paper https://link.springer.com/chapter/10.1007/978-3-030-79150-6_10

Simplified, one goal is to provide a user with a ready-to-use script and model without requiring the AutoML library itself. Therefore I´m moving the model during a template creation out of MLJAR and into scikit-learn.

  1. Alright, that's great to know! I might have a somewhat obvious question, but as someone who hasn't had much opportunity to work with scikit-learn I´m not sure what the model files are. Or how they are loaded into Scikit-learn. As far as I understand, scikit-learn has no direct import way except using pickle and pickle. But I can not find any file which would be used for it? Or is the model saved inside e.g. learner_fold_0_tree ?

  2. OK. Thank you

  3. OK. Thank you

pplonski commented 2 years ago

The speed of response depends on the work type I have. If I code then it can be even 2 weeks, if I work on marketing things then it can be fast ;) (BTW, just finished the youtube video about MLJAR Studio integration with Google Sheets https://youtu.be/eFAimK2UbOM) ;)

The article looks amazing! Good idea.

For scikit-learn models MLJAR AutoML is using joblib package. Please take a look at the code for save and load https://github.com/mljar/mljar-supervised/blob/f695fe5cad7fd075c6d7e2a72e9b8f8f18ddb1f2/supervised/algorithms/sklearn.py#L43-L51

Each algorithm used in MLJAR AutoML has a method file_extension() that returns the extension for model file. For example for Decision Tree: https://github.com/mljar/mljar-supervised/blob/f695fe5cad7fd075c6d7e2a72e9b8f8f18ddb1f2/supervised/algorithms/decision_tree.py#L106-L107

All files *.decision_tree in model folder contain the model files that can be loaded with joblib.