microsoft / FLAML

A fast library for AutoML and tuning. Join our Discord: https://discord.gg/Cppx2vSPVP.
https://microsoft.github.io/FLAML/
MIT License
3.75k stars 495 forks source link

HOW TO RETRAIN BEST MODEL ON FULL DATA AFTER TEST SCORE #712

Open luigif2000 opened 1 year ago

luigif2000 commented 1 year ago

Dear, in TimeSeries FLAML module examples You kindly show how to use flaml and in the example is also used a test set data (the last part of all data, generally); after the flaml model end to search the best model You show how to score the test set data using the best model. Your examples end. I THINK that is missing a phase: retrain the best model on the full, repeat full data. I tried but i fail. Please could you kindly reply : 1) if my idea is right 2) how to retrain the best model on the full, full data (with the hyperparameters founded)

thank in advance and waiting your kindly reply.

Luigif2000

sonichi commented 1 year ago

The test data are not part of the training data. They are not known at training time. Whatever you pass to AutoML.fit() should be the full training data. The model is already retrained on the full training data when the AutoML.fit() finishes. During the hyperparameter search, part of the full training data are split out as validation data. But at the end of AutoML.fit(), the best configuration is used to retrain on the full training data. For example, these two lines in the output indicate that the model is retrained: [flaml.automl: 01-21 07:54:14] {2824} INFO - retrain prophet for 0.6s [flaml.automl: 01-21 07:54:14] {2831} INFO - retrained model: <prophet.forecaster.Prophet object at 0x7fb68ea65d60>

luigif2000 commented 1 year ago

Dear thanks for your kindly reply, that's very kind of You! You true but my question is regarding the test data that at the end is non trained or fitted, alias the optimized model doesn't take in count this part of the data. I hope i explain well the problem. let me try to explain better: according your kindly reply and examples the optimized model (the "best") doesn't know the test set data; in this way i lose this kind of information in set test data. Is there a way to "retrain" or "refit", something like this, the set test data!? I hope i explain the problem.

Is there another question: I'm imbarazed but really i can't understand the period parameter: why is initialized at training step? how to choice is value? let me explain better: many timeseries forecasting framework doesn't use this parameter; only after the "best" model is discovered, the user ask a prediction with a period parameter (the forecast horizon); why use before this time a period variable? how to set properly? I can't find any documentation about this parameter. An example about this: i have to make one head day prediction: i have to set period to 1? sorry for my bad english and thanks a lot in advance.

best regards luigif2000

Il giorno gio 1 set 2022 alle ore 16:20 Chi Wang @.***> ha scritto:

The test data are not part of the training data. They are not known at training time. Whatever you pass to AutoML.fit() should be the full training data. The model is already retrained on the full training data when the AutoML.fit() finishes. During the hyperparameter search, part of the full training data are split out as validation data. But at the end of AutoML.fit(), the best configuration is used to retrain on the full training data. For example, these two lines in the output https://microsoft.github.io/FLAML/docs/Examples/AutoML-Time%20series%20forecast#sample-output-1 indicate that the model is retrained: [flaml.automl: 01-21 07:54:14] {2824} INFO - retrain prophet for 0.6s [flaml.automl: 01-21 07:54:14] {2831} INFO - retrained model: <prophet.forecaster.Prophet object at 0x7fb68ea65d60>

— Reply to this email directly, view it on GitHub https://github.com/microsoft/FLAML/issues/712#issuecomment-1234351181, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNWP4KRP4GV6K2L5YSFRTTV4C3URANCNFSM6AAAAAAQCMGTYI . You are receiving this because you authored the thread.Message ID: @.***>

MichaelMarien commented 1 year ago

Hi, in general I think you can do this "outside" of FLAML as follows. If automl is your fitted FLAML instance:

from sklearn.base import clone
final_model = clone(automl.model)
final_model.fit(X_all, y_all)

hope this helps.

sonichi commented 1 year ago

Dear thanks for your kindly reply, that's very kind of You! You true but my question is regarding the test data that at the end is non trained or fitted, alias the optimized model doesn't take in count this part of the data. I hope i explain well the problem. let me try to explain better: according your kindly reply and examples the optimized model (the "best") doesn't know the test set data; in this way i lose this kind of information in set test data. Is there a way to "retrain" or "refit", something like this, the set test data!? I hope i explain the problem. Is there another question: I'm imbarazed but really i can't understand the period parameter: why is initialized at training step? how to choice is value? let me explain better: many timeseries forecasting framework doesn't use this parameter; only after the "best" model is discovered, the user ask a prediction with a period parameter (the forecast horizon); why use before this time a period variable? how to set properly? I can't find any documentation about this parameter. An example about this: i have to make one head day prediction: i have to set period to 1? sorry for my bad english and thanks a lot in advance. best regards luigif2000 Il giorno gio 1 set 2022 alle ore 16:20 Chi Wang @.> ha scritto: The test data are not part of the training data. They are not known at training time. Whatever you pass to AutoML.fit() should be the full training data. The model is already retrained on the full training data when the AutoML.fit() finishes. During the hyperparameter search, part of the full training data are split out as validation data. But at the end of AutoML.fit(), the best configuration is used to retrain on the full training data. For example, these two lines in the output https://microsoft.github.io/FLAML/docs/Examples/AutoML-Time%20series%20forecast#sample-output-1 indicate that the model is retrained: [flaml.automl: 01-21 07:54:14] {2824} INFO - retrain prophet for 0.6s [flaml.automl: 01-21 07:54:14] {2831} INFO - retrained model: <prophet.forecaster.Prophet object at 0x7fb68ea65d60> — Reply to this email directly, view it on GitHub <#712 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNWP4KRP4GV6K2L5YSFRTTV4C3URANCNFSM6AAAAAAQCMGTYI . You are receiving this because you authored the thread.Message ID: @.>

The information in true test data is not observable when you train the model. For example, if you want to forecast for tomorrow, as of today, you don't have the data about tomorrow. If you mean you want to retrain the model after you obtain the data tomorrow, then it becomes part of the new training data. You could run AutoML.fit() again with the new training data. Or, do you mean you want to retrain the model with the same configuration as you found today on new training data? If so, you can take @MichaelMarien 's answer as one approach.

luigif2000 commented 1 year ago

Dear michael and sonichi, thanks for your replies! 1) unfotunatelly the great michael suggestion doesn't work: no way. Various error encountered.....

The big problems remain to understand the theory behind this problems: let me to explain: a) the set test data is usefull to score the best model: we don't use it during train b) after the best model is founded why don't retrain to all data, all ( it's called the production model that know all data)? Is it right?

I think is important to solve this issue, flaml should take care about this(i'm quite sure that pycaret do )...... Best regards and thank

luigif2000 commented 1 year ago

Hope to simplify in a phrase: the best final model (the production model) should know all, ALL the data!? Is it right?

luigif2000 commented 1 year ago

Dear sonichi another issue: i have always thought that a multivariate time series forecasting model should take care about the feature using lagged label? In this way a need to know today data (without label obvoiosly) in order to forecast tomorror label!? (1day example) Is it true? Regression is not forecasting..... Hope it help and not wrong

luigif2000 commented 1 year ago

https://github.com/pycaret/pycaret/issues/975

luigif2000 commented 1 year ago

https://analyticsindiamag.com/hands-on-guide-to-darts-a-python-tool-for-time-series-forecasting/

MichaelMarien commented 1 year ago

Hi,

  1. unfotunatelly the great michael suggestion doesn't work: no way. Various error encountered.....

sad to hear. Could you give some insight in the error as this works like a charm for me? Note you might still need to preprocess your data before you can feed it into the cloned model when dealing with time series.

Hope to simplify in a phrase: the best final model (the production model) should know all, ALL the data!? Is it right?

Yes, it's common practice to first tune hyperparameters and test a model using a train/valid/test split or cross-validation (FLAML does this for you). Second, retrain the best found pipeline on the full dataset (train+valid+test). Note that you don't have anymore a fair estimate of the performance (but you assume it's at least as good as the one found during phase 1? as now you have more data).

Some people feel uncomfortable with this as they lack that performance test of the final model, but I some use cases I believe more in testing the e2e approach than a particularly trained model. Also, often it's not done as plenty of data is available. I'm not sure about time-series as I don't consider myself an expert in that field.

luigif2000 commented 1 year ago

Sure dear Michael, I agree at all; about the problem:

STEP 1) automl.fit(X_train=X[:-100], y_train=y[:-100], **settings, period=5, estimator_list=['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth'], seed=2, n_jobs=1, ) THE AUTOML FIT END REGURALLY WITHOUT ERROR

STEP 2) from sklearn.base import clone final_model = clone(automl.model) final_model.fit(X, y)

ERROR *** KeyError: 'period'

CHANGED final_model.fit(X, y,period=5)

ERROR ValueError: Found array with 0 sample(s) (shape=(0, 2420)) while a minimum of 1 is required by RandomForestRegressor.

any suggest?

thanks in advance... luigi

luigif2000 commented 1 year ago

as You can see i used X[:-100],y[:-100] with automl -> NO ERROR

than i simply use X, y with clone -> ERROR ValueError: Found array with 0 sample(s) (shape=(0, 2420)) while a minimum of 1 is required by RandomForestRegressor.

why?

sonichi commented 1 year ago

as You can see i used X[:-100],y[:-100] with automl -> NO ERROR

than i simply use X, y with clone -> ERROR ValueError: Found array with 0 sample(s) (shape=(0, 2420)) while a minimum of 1 is required by RandomForestRegressor.

why?

Does automl.fit(X, y, estimator_list=['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth'], period=5) work?

luigif2000 commented 1 year ago

HI dear, YES automl.fit(X, y, estimator_list=['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth'], period=5) WORK WITHOUT ERROR

sonichi commented 1 year ago

HI dear, YES automl.fit(X, y, estimator_list=['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth'], period=5) WORK WITHOUT ERROR

In this case, all the data are used to retrain the model in the end. Does that solve your problem?

luigif2000 commented 1 year ago

Dear sonichi, the automl.fit(X,y,...) change the best final model......is not the same thing of michael kindly not working suggestion....

in other words:

a) automl.fit(X[:-100],y[:-100]..... retrain with clone of the previous best model on X, y

is not

b) automl.fit(X,y....

hope help

sonichi commented 1 year ago

Dear sonichi, the automl.fit(X,y,...) change the best final model......is not the same thing of michael kindly not working suggestion....

in other words:

a) automl.fit(X[:-100],y[:-100]..... retrain with clone of the previous best model on X, y

is not

b) automl.fit(X,y....

hope help

Why do you need to use the same model for X{:-100] and X?

luigif2000 commented 1 year ago

I need to: 1) train on X[:-100] 2) scoring on X[-100:] 3) train the best model found in point 2 on all X data (without changing hyperparameter)

it's the finalize concept in pycaret..... I think is right, but i can't how to do in flaml

sonichi commented 1 year ago

I need to:

  1. train on X[:-100]
  2. scoring on X[-100:]
  3. train the best model found in point 2 on all X data (without changing hyperparameter)

it's the finalize concept in pycaret..... I think is right, but i can't how to do in flaml

Now I understand what you want. It's as easy as:

automl.fit(X, y, eval_method="holdout", period=100)

The retraining automatically happens in the end and you can see that from the console log. The scoring is done on X[-100:] with models trained on X[:-100] during the model search and you can see that too from the console log. You can also find the scores in a log file if you set a log_file_name.

luigif2000 commented 1 year ago

I’m very sorry I didn’t explain well, excuse me. Unfortunately the subject ML is borning now and we have to change our natural neuron to think Ml. Anyway: thanks so much again for the patient. Unfortunately I suspect that Your kindly solution is not right (here the reason of the finalize function in pycaret):The test set data is not the validation data. In other word: the test set data should not involved in training; I think. I believe that the right process should be: 1) train on train set data, scoring on validation data => best model 2) fit the best model on test set data

hope i explain better and right

luigif2000 commented 1 year ago

anyway i understand better the meaning of period parameters, (you remember i asked the meaning of period parameter at the beginning) thanks

qingyun-wu commented 1 year ago

Hi @luigif2000,

Thank you for the questions.

First, the "finalize" concept in pycaret is just trying to train the model on the data including both training and holdout set. It is not touching the test data. @Yard1 can you please help confirm this or clarify the meaning of the "finalize" concept?

Second, FLAML's AutoML also supports retraining the model on all the data in a similar way as mentioned by @sonichi in the conversations above.

Finally, I guess your confusion is caused by the understanding of the role of "test data". Below are some of my understandings on the test data:

Hope this explanation resolves your question and confusion.

Thank you!

Yard1 commented 1 year ago

@qingyun-wu finalize_model in pycaret trains the model on all available data. In pycaret, we divide the dataset into train and test (holdout) sets, and use cross validation on the train dataset for validation. We consider test and holdout synonymous. Therefore, finalize_model will train a model on both train and test/holdout sets.

qingyun-wu commented 1 year ago

@qingyun-wu finalize_model in pycaret trains the model on all available data. In pycaret, we divide the dataset into train and test (holdout) sets, and use cross validation on the train dataset for validation. We consider test and holdout synonymous. Therefore, finalize_model will train a model on both train and test/holdout sets.

@Yard1, Thanks for your prompt reply. Regarding "_In pycaret, we divide the dataset into train and test (holdout) sets, and use cross validation on the train dataset for validation", what is the holdout set held for?

Yard1 commented 1 year ago

Final unbiased evaluation (with cross-validation on training set used during tuning etc.).

qingyun-wu commented 1 year ago

What is the unbiased evaluation used for? Is it always needed? If I do not need to do an additional step of evaluation after finding the best model via AutoML, this holdout is not needed, right?

Yard1 commented 1 year ago

It's there to detect overfitting - while it's rare to overfit to CV, it's not unheard of. If the results from CV and holdout are widely different, that hints at a problem. That being said it is not needed and in PyCaret you can choose to have a holdout dataset of size 0.

qingyun-wu commented 1 year ago

It's there to detect overfitting - while it's rare to overfit to CV, it's not unheard of. If the results from CV and holdout are widely different, that hints at a problem. That being said it is not needed and in PyCaret you can choose to have a holdout dataset of size 0.

Got it. Thanks for the clarification!

qingyun-wu commented 1 year ago

Hi @luigif2000, Can you please read my conversations with Yard1.

In a nutshell, pycaret divides the dataset into train and test (holdout) sets, and use cross-validation on the train dataset for validation. finalize_model in pycaret trains the model on all available data. The test data in pycaret is held out for detecting overfitting and is optional.

FLAML dose not have the built-in step of holding out one part of the dataset for a final unbiased evaluation step (which corresponds to the case of setting holdout to 0 in pycaret). Considering this difference, I have the following comments on how to use flaml's AutoML:

Case 1: If you do not need the final unbiased evaluation step, you can just provide your full dataset to AutoML with automl.fit(X, y, **other_settings) The retraining on the full data (X, y) automatically happens in the end.

Case 2: If you do need to hold out a test data for final evaluation, you need to split the data (X, y) into train and test manually, and retrain the model on the entire dataset manually. In this case, flaml’s AutoML module indeed should add a function to support a more seamless model retraining.

Please let me know what you think about my comments.

Thank you!

luigif2000 commented 1 year ago

Dear qingyun-wu, thanks so much for the kindly reply. I have realized many thinks thanks to you:

1) FLAML use (TRAIN +VALIDATION) for training 2) the lenght of VALIDATION is the PERIOD parameter 3) flaml use (TRAIN +VALIDATION) for training and TESTSET for scoring, with ALL=TRAIN +VALIDATION + TESTSET 3) PYCARET use (TRAIN +TESTSET/HOLDOUT) for training, then finalize/retrain on (TRAIN +TESTSET/HOLDOUT) 4) "flaml’s AutoML module indeed should add a function to support a more seamless model retraining" on TESTSET ( i was looking for this feature, i realize now)

I hope these points are right.

The unbiased validation data (the TESTSET in flaml names convention) need to be quite big; for this reason I always thought that retraining/fit on TESTSET (flaml names convention) was mandatory; I didn't find in flaml examples any refers about this, and therefore i went confused.

Sorry

Please could You kindly check the POINT 2) in particulary (the PERIOD parameter still is a little mistery for me, i can't find any good documentation about it!?

thank in advance

sonichi commented 1 year ago

Dear qingyun-wu, thanks so much for the kindly reply. I have realized many thinks thanks to you:

  1. FLAML use (TRAIN +VALIDATION) for training
  2. the lenght of VALIDATION is the PERIOD parameter
  3. flaml use (TRAIN +VALIDATION) for training and TESTSET for scoring, with ALL=TRAIN +VALIDATION + TESTSET
  4. PYCARET use (TRAIN +TESTSET/HOLDOUT) for training, then finalize/retrain on (TRAIN +TESTSET/HOLDOUT)
  5. "flaml’s AutoML module indeed should add a function to support a more seamless model retraining" on TESTSET ( i was looking for this feature, i realize now)

I hope these points are right.

The unbiased validation data (the TESTSET in flaml names convention) need to be quite big; for this reason I always thought that retraining/fit on TESTSET (flaml names convention) was mandatory; I didn't find in flaml examples any refers about this, and therefore i went confused.

Sorry

Please could You kindly check the POINT 2) in particulary (the PERIOD parameter still is a little mistery for me, i can't find any good documentation about it!?

thank in advance

3 and 4 are incorrect: flaml doesn't use anything for scoring. The scoring example in the notebook is meant for a user to check the performance after the trained model is deployed, not before deployment. PYCARET uses TRAIN for training, not TRAIN+TESTSET/HOLDOUT. 2 is correct. When holdout is used as the evaluation method, only one train/validation is produced. When cross-validation is used as the eval_method, multiple train/validation splits will be produced, while the length of validation is always PERIOD. @int-chaos could you improve the documentation about PERIOD per @luigif2000 's request?

luigif2000 commented 1 year ago

You true! sure!.......thanks so much, sincersly and excuse me for my misunderstanding.

That' very kind of You and Your patience very big.

I got it.

wish the best!!!

sonichi commented 1 year ago

You true! sure!.......thanks so much, sincersly and excuse me for my misunderstanding.

That' very kind of You and Your patience very big.

I got it.

wish the best!!!

You are welcome. Feel free to chat on gitter.

luigif2000 commented 1 year ago

Great! I will try...thanks thanks so much

Il Gio 1 Set 2022, 20:12 Michael Marien @.***> ha scritto:

Hi, in general I think you can do this "outside" of FLAML as follows. If automl is your fitted FLAML instance:

from sklearn.base import clone final_model = clone(automl.model) final_model.fit(X_all, y_all)

hope this helps.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/FLAML/issues/712#issuecomment-1234621697, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNWP4OUICJFKIE3Q2DTO43V4DWZ5ANCNFSM6AAAAAAQCMGTYI . You are receiving this because you authored the thread.Message ID: @.***>

datqduong commented 5 months ago

Hi @luigif2000, Can you please read my conversations with Yard1.

In a nutshell, pycaret divides the dataset into train and test (holdout) sets, and use cross-validation on the train dataset for validation. finalize_model in pycaret trains the model on all available data. The test data in pycaret is held out for detecting overfitting and is optional.

FLAML dose not have the built-in step of holding out one part of the dataset for a final unbiased evaluation step (which corresponds to the case of setting holdout to 0 in pycaret). Considering this difference, I have the following comments on how to use flaml's AutoML:

Case 1: If you do not need the final unbiased evaluation step, you can just provide your full dataset to AutoML with automl.fit(X, y, **other_settings) The retraining on the full data (X, y) automatically happens in the end.

Case 2: If you do need to hold out a test data for final evaluation, you need to split the data (X, y) into train and test manually, and retrain the model on the entire dataset manually. In this case, flaml’s AutoML module indeed should add a function to support a more seamless model retraining.

Please let me know what you think about my comments.

Thank you!

Hi, do you have any updates on Case 2?