microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.88k stars 505 forks source link

How can i do a muti-output regression task #301

Open fengsxy opened 2 years ago

fengsxy commented 2 years ago

when i do a muti-output regression task,it says that 图片

but i think the shape is matching.

fengsxy commented 2 years ago

About the reason, I find that y will be flattern in this code.But for example ExtraTree are supported for the muti-output regression. 图片


            assert (
                isinstance(X_train_all, np.ndarray)
                or issparse(X_train_all)
                or isinstance(X_train_all, pd.DataFrame)
            ), (
                "X_train_all must be a numpy array, a pandas dataframe, "
                "or Scipy sparse matrix."
            )
            assert isinstance(y_train_all, np.ndarray) or isinstance(
                y_train_all, pd.Series
            ), "y_train_all must be a numpy array or a pandas series."
            assert (
                X_train_all.size != 0 and y_train_all.size != 0
            ), "Input data must not be empty."
            if isinstance(X_train_all, np.ndarray) and len(X_train_all.shape) == 1:
                X_train_all = np.reshape(X_train_all, (X_train_all.size, 1))
            if isinstance(y_train_all, np.ndarray):
                **y_train_all = y_train_all.flatten()**
            assert (
                X_train_all.shape[0] == y_train_all.shape[0]
            ), "# rows in X_train must match length of y_train."
            self._df = isinstance(X_train_all, pd.DataFrame)
            self._nrow, self._ndim = X_train_all.shape
            if self._state.task == TS_FORECAST:
                X_train_all = pd.DataFrame(X_train_all)
                assert (
                    X_train_all[X_train_all.columns[0]].dtype.name == "datetime64[ns]"
                ), f"For '{TS_FORECAST}' task, the first column must contain timestamp values."
            X, y = X_train_all, y_train_all
        elif dataframe is not None and label is not None:
            assert isinstance(
                dataframe, pd.DataFrame
            ), "dataframe must be a pandas DataFrame"
            assert label in dataframe.columns, "label must a column name in dataframe"
            self._df = True
            if self._state.task == TS_FORECAST:
                assert (
                    dataframe[dataframe.columns[0]].dtype.name == "datetime64[ns]"
                ), f"For '{TS_FORECAST}' task, the first column must contain timestamp values."
            X = dataframe.drop(columns=label)
            self._nrow, self._ndim = X.shape
            y = dataframe[label]
        else:
            raise ValueError("either X_train+y_train or dataframe+label are required")`
sonichi commented 2 years ago

Example for multi-output regression: https://github.com/microsoft/FLAML/blob/85e21864ce20a323fefe297b18b8aa75f9a72f59/test/automl/test_regression.py#L198

This works if you install flaml from git: pip install git+https://github.com/microsoft/FLAML.git before v0.8.0 release.

Related: #277 #292

sonichi commented 2 years ago

@fengsxy @miquelduranfrigola @gmkyadav I have an idea to improve the AutoML performance for multi-output task. After AutoML.fit() finishes for one output, we could use the warm-start capability from flaml for the next output. That is supposed to work better than doing the search from scratch. See an example of warm-start here: https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#warm-start Would you like to implement that idea together and test it in your applications?

(cc @slhuang @int-chaos )

miquelduranfrigola commented 2 years ago

@fengsxy @miquelduranfrigola @gmkyadav I have an idea to improve the AutoML performance for multi-output task. After AutoML.fit() finishes for one output, we could use the warm-start capability from flaml for the next output. That is supposed to work better than doing the search from scratch. See an example of warm-start here: https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#warm-start Would you like to implement that idea together and test it in your applications?

(cc @slhuang @int-chaos )

Hi @sonichi I think this is a good idea. I am not an AI/ML expert myself (I am more on the user side), but would be very happy to test on my applications. However, I wonder: does the warm-start idea work when the outputs are not correlated?

sonichi commented 2 years ago

@fengsxy @miquelduranfrigola @gmkyadav I have an idea to improve the AutoML performance for multi-output task. After AutoML.fit() finishes for one output, we could use the warm-start capability from flaml for the next output. That is supposed to work better than doing the search from scratch. See an example of warm-start here: https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#warm-start Would you like to implement that idea together and test it in your applications? (cc @slhuang @int-chaos )

Hi @sonichi I think this is a good idea. I am not an AI/ML expert myself (I am more on the user side), but would be very happy to test on my applications. However, I wonder: does the warm-start idea work when the outputs are not correlated?

@miquelduranfrigola It works as long as the hyperparameter choice for different outputs are correlated. The outputs themselves don't need to be correlated. Given that the features used to predict different outputs are the same, chance is that the hp chosen for one output is a good starting point for the others.

fengsxy commented 2 years ago

@fengsxy @miquelduranfrigola @gmkyadav I have an idea to improve the AutoML performance for multi-output task. After AutoML.fit() finishes for one output, we could use the warm-start capability from flaml for the next output. That is supposed to work better than doing the search from scratch. See an example of warm-start here: https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#warm-start Would you like to implement that idea together and test it in your applications?

(cc @slhuang @int-chaos )

Hi @sonichi ,So does it mean that we train the one of the single output first,and then use the output as a feature and best_config_per_estimator as a starting_point for next automl?The advantage about it is that it can start with a better point and be adapative with related output?Because my example is two output regressor problem,and the two output are related.

@fengsxy @miquelduranfrigola @gmkyadav I have an idea to improve the AutoML performance for multi-output task. After AutoML.fit() finishes for one output, we could use the warm-start capability from flaml for the next output. That is supposed to work better than doing the search from scratch. See an example of warm-start here: https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#warm-start Would you like to implement that idea together and test it in your applications? (cc @slhuang @int-chaos )

Hi @sonichi I think this is a good idea. I am not an AI/ML expert myself (I am more on the user side), but would be very happy to test on my applications. However, I wonder: does the warm-start idea work when the outputs are not correlated?

@miquelduranfrigola It works as long as the hyperparameter choice for different outputs are correlated. The outputs themselves don't need to be correlated. Given that the features used to predict different outputs are the same, chance is that the hp chosen for one output is a good starting point for the others.

sonichi commented 2 years ago

@fengsxy @miquelduranfrigola @gmkyadav I have an idea to improve the AutoML performance for multi-output task. After AutoML.fit() finishes for one output, we could use the warm-start capability from flaml for the next output. That is supposed to work better than doing the search from scratch. See an example of warm-start here: https://microsoft.github.io/FLAML/docs/Use-Cases/Task-Oriented-AutoML#warm-start Would you like to implement that idea together and test it in your applications? (cc @slhuang @int-chaos )

Hi @sonichi ,So does it mean that we train the one of the single output first,and then use the output as a feature and best_config_per_estimator as a starting_point for next automl?The advantage about it is that it can start with a better point and be adapative with related output?Because my example is two output regressor problem,and the two output are related.

@fengsxy whether to use the first output as a feature is up to you. In either case, you can use best_config_per_estimator as a starting_point for next automl. Would you like to try that idea in your application?

fengsxy commented 2 years ago

@fengsxy whether to use the first output as a feature is up to you. In either case, you can use best_config_per_estimator as a starting_point for next automl. Would you like to try that idea in your application? @sonichi I plan to try this idea,but i just have a 5000 data ,so i set the search_time short.But i'll try it and share my result.

sonichi commented 2 years ago

fengsxy

Thanks. BTW, what application do you use flaml for?

fengsxy commented 2 years ago

Thanks. BTW, what application do you use flaml for?

@sonichi My data is use a lot of contious data(feature) to predict the PM2.5/PM10(contious data)

fengsxy commented 2 years ago

image @sonichi Hi,sonchi. I have two things to share! Today i am trying to use warm-start,but when i use muti-Reregssor with two out-put first it has some problems. And with these time's thinking,i have a new a idea,for example ,like Extra-Tree,it needn't use muti-regressor,so we can check the data ,and choose the model which is naturaly adataped to muti-ouput-regreession task.

sonichi commented 2 years ago

image @sonichi Hi,sonchi. I have two things to share! Today i am trying to use warm-start,but when i use muti-Reregssor with two out-put first it has some problems. And with these time's thinking,i have a new a idea,for example ,like Extra-Tree,it needn't use muti-regressor,so we can check the data ,and choose the model which is naturaly adataped to muti-ouput-regreession task.

automl1 is an instance of MultiOutputRegressor. So it doesn't have attribute best_config_per_estimator. I think the idea we discussed was to run AutoML for the first label first (single output regression). And then take the best_config_per_estimator as the starting_point for the next label, and so on.

The new idea you have, requires some change in _preprocess() function of each estimator in model.py. Right now we require y_train to be a 1-d array or a pandas Series. If you pass the multiple labels per row as a list, for example, in y_train, then you need to convert y_train into a format that is recognizable by ExtraTrees for multi-regression (like a 2D array or a DataFrame).