best_model_for_estimator() returns empty objects for some estimators

flippercy commented 2 years ago

Hi @sonichi:

I just found out that after running automl with several customized estimators, for certain estimators, best_model_for_estimator() returned an empty object although a few models have been built with them. When I tried to save the best model, there were error messages like below:

AttributeError: 'NoneType' object has no attribute 'save_model'

Do you know why? My team has been using FLAML for quite a while so it is not due to coding errors or time budget. The dataset is big; however, we used similar datasets with no issue before. Our FLAML is the latest version, 0.9.5 and we set model_history = True.

Thank you.

flippercy commented 2 years ago

In addition, during the training I noticed some discrepancies in the output as shown below:

'MonotonicCatboost' and 'MonotonicXgboostDart' are my customized classifiers. In this case, after training a monotonic catboost, shouldn't the program return the best performance of MonotonicCatboost instead of MonotonicXgboostDart?

sonichi commented 2 years ago

In addition, during the training I noticed some discrepancies in the output as shown below:

'MonotonicCatboost' and 'MonotonicXgboostDart' are my customized classifiers. In this case, after training a monotonic catboost, shouldn't the program return the best performance of MonotonicCatboost instead of MonotonicXgboostDart?

What's the search space for "MonotonicCatboost"? One possible reason is that the sampler fails to find a new config for MonotonicCatboost, so it is skipped. And for some reason some lines in the console log are missing.

sonichi commented 2 years ago

Hi @sonichi:

I just found out that after running automl with several customized estimators, for certain estimators, best_model_for_estimator() returned an empty object although a few models have been built with them. When I tried to save the best model, there were error messages like below:

AttributeError: 'NoneType' object has no attribute 'save_model'

Do you know why? My team has been using FLAML for quite a while so it is not due to coding errors or time budget. The dataset is big; however, we used similar datasets with no issue before. Our FLAML is the latest version, 0.9.5 and we set model_history = True.

Thank you.

If you make the estimator_list contain a single estimator that causes this issue, do you get the same problem?

flippercy commented 2 years ago

Hi @sonichi:

There is no problem if I just run automl() with the estimator causing the issue.

Moreover, my observations include:

The issue is not data-specific. A coworker ran automl() with a different and much smaller dataset and got the same issue.
The issue is kind of random. Later yesterday I re-ran automl() with less CPUs and somehow it worked without any problem. Not sure whether n_jobs was the reason or it was just luck.
Probably it is related with some recent updates? I personally used FLAML a lot for similar cases and never met this problem before.

Thank you.

flippercy commented 2 years ago

In addition, during the training I noticed some discrepancies in the output as shown below: 'MonotonicCatboost' and 'MonotonicXgboostDart' are my customized classifiers. In this case, after training a monotonic catboost, shouldn't the program return the best performance of MonotonicCatboost instead of MonotonicXgboostDart?

What's the search space for "MonotonicCatboost"? One possible reason is that the sampler fails to find a new config for MonotonicCatboost, so it is skipped. And for some reason some lines in the console log are missing.

The search space for MonotonicCatboost is quite big and should not be the reason for the discrepancy; more catboost models were built later in that search and the issue did not happen anymore. Probably it is just a one-time glitch but want to let you know.

sonichi commented 2 years ago

Hi @sonichi:

There is no problem if I just run automl() with the estimator causing the issue.

Moreover, my observations include:

The issue is not data-specific. A coworker ran automl() with a different and much smaller dataset and got the same issue.

The issue is kind of random. Later yesterday I re-ran automl() with less CPUs and somehow it worked without any problem. Not sure whether n_jobs was the reason or it was just luck.

Probably it is related with some recent updates? I personally used FLAML a lot for similar cases and never met this problem before.

Thank you.

Good that it's not data-specific. Bad that it's random. It happens to the custom estimator only, right?

flippercy commented 2 years ago

Hi @sonichi: There is no problem if I just run automl() with the estimator causing the issue. Moreover, my observations include:

The issue is not data-specific. A coworker ran automl() with a different and much smaller dataset and got the same issue.

The issue is kind of random. Later yesterday I re-ran automl() with less CPUs and somehow it worked without any problem. Not sure whether n_jobs was the reason or it was just luck.

Probably it is related with some recent updates? I personally used FLAML a lot for similar cases and never met this problem before.

Thank you.

Good that it's not data-specific. Bad that it's random. It happens to the custom estimator only, right?

I am not sure since we usually use customized estimators only; might do some tests with default estimators later.

sonichi commented 2 years ago

I haven't seen this issue before. So there might be some unexpected behaviors of the custom estimators. Could you use log_type="all" and check the logged results, and see if there is any anomaly?

flippercy commented 2 years ago

@sonichi this issue has not happened since then so I will close the thread for now. We may discuss it later if it appears again.

Thank you very much for the help!

TimSchim commented 2 years ago

I noticed this issue was closed but I just encountered the same problem with a default estimator.

[automl.best_model_for_estimator(e) for e in automl.estimator_list]

returns

[<flaml.model.LGBMEstimator at 0x260821d54c0>,
 <flaml.model.RandomForestEstimator at 0x260821f6fd0>,
 <flaml.model.CatBoostEstimator at 0x260821f66a0>,
 <flaml.model.XGBoostSklearnEstimator at 0x260821d5a00>,
 <flaml.model.ExtraTreesEstimator at 0x260821f6f10>,
 None]

where the last estimator should be xgb_limitdepth, which is in automl.estimator_list.

Since this did not happen for a colleague with same estimators and data it might in fact be somewhat random.

flippercy commented 2 years ago

And I can confirm that this issue still exists for me.

TimSchim commented 2 years ago

Found the source of this for my case. Seems like xgb_limitdepth gets relatively few resources compared to the other estimators which leads to no model being fit for this estimator. Increasing the time_budget solved it for me.

Edit: This only reduces the chance of this problem.

sonichi commented 2 years ago

Thanks @TimSchim @flippercy @TimSchim there is no guarantee to train every estimator in the estimator list within the time budget. One way to increase the priority of a particular estimator is to redefine the cost_relative2lgbm() function for the corresponding estimator class. The lower is the cost, the higher is the priority.

flippercy commented 1 year ago

Hi @sonichi:

It's been a while and I hope you are doing well. Glad to see your package, FLAML, attracting so much attention and already got almost 2k stars.

I have to follow up on this issue again because it has never been really solved and we have no clue what caused it. During the last six months, my team have been using several versions of FLAML with various datasets on different platforms. And we still got this issue - a certain estimator was trained by FLAML but not saved in the output, occasionally. Based on total_iter returned by _search_states.items(), the estimator impacted had been trained a lot of times (50-100 usually); however, best_model_for_estimator() still returned nothing but an empty object for it.

The only fix, based on my experience, is simply to restart the session, change the random seed and rerun the search. The issue is unrelated with the OS, version of FLAML or size of data used. I am not sure whether it is due to the customized estimators used but as I said, most of the time the process ran well without any problem; the issue only happens randomly. Moreover, the log file looks total normal even when the issue happened. The setting of reticulate might be a possible cause because unlike most users here, we use FLAML in R via reticulate and it seems that very few people experienced the same issue as we did.

Not sure whether we can find a way to replicate this error on your end for troubleshooting. Currently it is a bit annoying because it made new users confused and suspicious of the process. Let me know if you have any suggestions.

Thank you.

Yu Cao

sonichi commented 1 year ago

Sorry for missing this for a long time. Does it still exist, @flippercy ?

flippercy commented 1 year ago

Hi @sonichi ! Thank you for the reply.

The issue still happens randomly and it is hard to reproduce for debugging. However, one observation is that it never happened (so far at least) when we ran FLAML in python. Therefore, we suspect that it may due to something in reticulate when we called FLAML in R.

Since our modeling process, especially the autoML component, is moving to python in AzureML, probably we do not need to worry about it right now. Let's see whether it happens again in the new environment.

Appreciate your help!

sonichi commented 1 year ago

Thank you @flippercy . In case I miss your message again in future, you can reach me and other maintainers on discord.

microsoft / FLAML

best_model_for_estimator() returns empty objects for some estimators #434