mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.03k stars 404 forks source link

Multi-output regression support #349

Open pplonski opened 3 years ago

PeterLuenenschloss commented 2 years ago

Support / integration for multioutput regression would be great! In a project, i am currently wrapping the AutoML instance with sklearn.multiputput models to achieve multioutput fitting. This works nearly. There are only 2 problems:

  1. Since models get trained consecutively, the results_path wont be empty after the first model is fit and subsequent training gets aborted.
  2. While multioutput Regression works (with results_path not set), the multioutput classification fails, since sklearn tries to access AutoML._classes when it does not exist. Dont know if that is solvable.
pplonski commented 2 years ago

@PeterLuenenschloss there should be added additional argument in AutoML constructor multi_output=True that will tell the AutoML object that it is going to train in multi-output environment. The final results can be saved as nested directories. The example:

automl = AutoML(result_path="AutoML_multi", multi_output=True)
clf = MultiOutputClassifier(automl).fit(X,Y)

There will be paths:

How the predictions are working in MultiOutputClassifier? Does it keep all objects in RAM?

PeterLuenenschloss commented 2 years ago

There will be paths:

AutoML_multi/AutoML_1 AutoML_multi/AutoML_2 AutoML_multi/AutoML_3 and so on, till the number of targets

Yes, thats how i also thought it should be!

there should be added additional argument in AutoML constructor multi_output=True

Maybe it is worth thinking about not only supporting simple MultiOutput, but also ChainRegression (or even defaulting to that) by wrapping with sklearnChainRgressor. In that case, there would also need to be an additional keyword, order, that allows for altering the default chain order, and also the results folder AutoML would need to somehow contain the model order mapping, for association of the trained AutoML models with the target indices.

How the predictions are working in MultiOutputClassifier? Does it keep all objects in RAM?

Yes the wrapper trains a model for every target dimension and combines the resulting fitted model objects to a model that predicts the array of those single value predictions, (just by ordering the results accordingly). The model instances are managed in the ram i guess. I Cant see no explicit to-disc-writing. The problem with the Classifier wrapper, is, that it tries to collect the prediction classes from the fitted single value models, after the fit is done, by accessing each models ._classes methods, wich are not implemented by fitted AutoML models. (But for example, are implemented by other sklearn-style model objects, like Xgboost). This step is done in the MultiOutputClassifier, just in order to assign the list of those collected classes to the _classes attribute of the constructed MultiOutput model object at the end.

xinlnix commented 2 years ago

I can not find the 'multi_ouput' in the source code and document. Could you explain how can I use multi-output regression for my tabular data?

pplonski commented 2 years ago

@xinlnix it is not yet implemented.

RaymondWKWong commented 2 years ago

Built in implementation would be great but for others who need this in the meantime, the following seems to work and returns multioutput predictions.

automl = AutoML(mode="Explain") clf = MultiOutputRegressor(automl).fit(x_train, y_train) predictions = clf.predict(x_test)

Karlheinzniebuhr commented 2 years ago

Built in implementation would be great but for others who need this in the meantime, the following seems to work and returns multioutput predictions.

automl = AutoML(mode="Explain") clf = MultiOutputRegressor(automl).fit(x_train, y_train) predictions = clf.predict(x_test)

This method fits the same model again for me,

X_train.shape, X_test.shape, y_train.shape, y_test.shape
((2492, 500), (623, 500), (2492, 3), (623, 3))
automl = AutoML(mode="Explain", results_path=model_path)
reg = MultiOutputRegressor(automl).fit(X_train, y_train)

This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()'.
This model has already been fitted. You can use predict methods or select a new 'results_path' for a new 'fit()'.