LGBM sometimes strongly extrapolates in a regression problem with huge feature set

DataDemystifier commented 2 years ago

Hey all,

currently, I am using LGBM in a project where I am facing a quite remarkable problem. It is a regression problem with a large featureset (partially due to 70 one-hot features). The total amount of features is ~ 200.

The situation I am facing is, that I have a feature that is outside of the typical bounds seen in the training data. The trained model goes crazy and more or less based on judging the loss directly overfits. When predicting, in a normal situation typical MAPE values are at max 100% or so. But in this case the model extrapolates very strongly and predicts almost 800% or even more. So really far away and exceed actually any value seen as a target in the training dataset. The validation loss shoots up more or less from the first boosting round onwards and keeps increasing.

Played around with the hyperparameters a bit but could no mitigate the problem.

Hence I would like to ask if such problems are common and if somebody knows a good mitigation strategy. An ideal Early Stopping for example would stop at boosting round 1 which does not seem like a right mitigation at all.

More generally: Is it known when LGBM gets problems with extrapolation? My hope was more thinking from a bagging perspective that extrapolation for a tree based model should happen only in extreme rare cases. But now I would like to understand why I see this on multiple occasions in my problem right now.

Any insights would be very valuable for me. Thanks a lot already!

Best regards!

jameslamb commented 2 years ago

Thanks for using LightGBM. Are you able to provide the details asked for in the template that was shown when you clicked "create issue", like:

version of LightGBM you are using and how you are building it
a reproducible example (code + publicly-available data) demonstrating the issue

Without information like that, you are asking maintainers here to just guess at what's going on, and I don't think that guessing is likely to lead you to a resolution.

Based on the information you've provided so far, the only thing I can say is that the MAPE objective in general (not just in LightGBM) can be unstable if you have target values that are less than 0 in absolute value. We try to account for that (see the discussion in #3608) but it can still present challenges.

Do you see issues with "extrapolating strongly" if you use a metric like regression_l1 (MAE) or regression_l2 (RMSE)?

DataDemystifier commented 2 years ago

Hello @jameslamb,

thanks a lot for your quick response. I am sorry, that I left a lot of uncertainty in my initial phrasing of the questions and hence would like to address the open points and clarify further so that the maintainers/developers of LGBM can get the bigger picture here. See therefore my description below according to the templates.

Description

As stated in the original first post, we are talking about a regression problem with a large featureset (partially due to 70 one-hot features). The total amount of features is ~ 200.

The situation we are facing is, that one of the total of 200 features shows the problematic behavior. We identified the feature by using SHAP value analysis.

I will try to give you two concrete numeric example here:

Example: During training, the respective feature sees the values: 1.29 up to 1.95. (these values are included in the training set). During validation/testing, the respective feature for the first time sees the value: 1.99 The model completely goes crazy in its output and overestimated heavily and the SHAP values, that the respective feature is driving the output almost entirely alone to those high values.
Example: During training, the respective feature sees only one value: 0.49 During validation/testing, the respective feature for the first time sees the value: 0.59 This situation is actually quite remarkable as I would hope that the feature itself has no meaning and information gain for the model, but yet again the model completely goes crazy in the output. The SHAP values again indicate, that this is only due to this one feature mainly.

Some hypothesis: Clearly, having data outside of the training regime in such high dimensional space is risky. Our current best guess is that the respective feature sort of compensates/offsets the whole prediction.

The question remaining, did you experience something like this? Is it common or even expected behavior and are there any tips/leads to mitigate the problem? It maybe isn't a bug I could imagine. Don't get me hence wrong here, I am not expecting any definitive answer, just maybe some interesting leads or explanations that might or might not support the hypothesis stated above.

Reproducible example

Unfortunately, in the project I am working in, LGBM is just one part of a huge architecture. The data is not publicly available and hence I cannot create an reproducible example here. I am sorry for that.

Environment info

Currently version 2.3.1 is used on the Azure Cloud. Docker and K8s is used. We are planning to soon bump the version but believe, that this is (hopefully) not the root-cause of the issue and that it may likely persist also with the newer versions. But that remains to be seen.

Additional Comments

My initial description regarding the configuration/loss was too unspecific. Actually we are using RMSE (regression_l2) as the loss function during training.

We are using also the standard set of hyperparameters, besides the following that were changed from the default values: objective: regression boosting: gbdt learning_rate: 0.01 max_depth: 5 bagging_fraction: 0.5 feature_fraction: 0.4 metric: RMSE num_threads: os.environ["NTHREAD"] num_leaves: 250 min_data_in_leaf: 5 bagging_freq: 5 num_boost_round: 1500

We played around with regularization and some other parameters to prevent overfitting, but our problems was not mitigated by this as it rather looks like an extrapolation issues due to the clear differences on train and test set (where a feature might reach out of the boundaries ever seen in the test set, see above description).

DataDemystifier commented 2 years ago

Just a quick sidenote: The newest stable release of LGBM was also tried out but did not mitigate the above stated issue.

shiyu1994 commented 2 years ago

@DataDemystifier Thanks for using LightGBM. Ideally, feature values in test set beyond the values in training set should not produce crazy large predictions. Because these feature values only determines which leaves in the decision trees a sample goes. The output value are just summation of leaf predictions values, which is saved in the model and unchanged after training. And a feature with a single value should be abandoned by LightGBM after preprocessing the data. So example 2 also surprised me. I have two quick questions:

Is the training loss normal during all iterations?
How far are the crazy outputs from the actual label values?

Thanks for the information.

DataDemystifier commented 2 years ago

@shiyu1994

thanks a lot for taking your time and answering!

Here are the answers to your two questions:

During training, the situation is fine, hence the train loss decreased nicely and monotonic. It is just the validation loss that increases from boosting round 1 onwards. Here is a plot of the train, val and test loss for example 2 stated in above post:
I'll give you some examples we are seeing: Normal Target Label -> Crazy Output of Model ~200 -> 3000-3400 ~500-1000 -> 5500-6500

There are various more showing the same pattern with different increases that reach clearly out of the range of typical label values. All have in common that the validation loss goes up from round 1 onwards and in validation or test there are values for the respective feature outside of the typical range and hence unseen in training yet.

shiyu1994 commented 2 years ago

@DataDemystifier Thanks for your example. Did you use distributed training? Is it possible to get the tree models from your pipeline? It would be super helpful if we can get the tree model and some instances in the validation set which shows abnormal prediction values.

DataDemystifier commented 2 years ago

@shiyu1994 thanks for your take. We don't use distributed training.

I am a bit unsure if I will be able to send you the tree and some validation set examples unfortunately.

But maybe you can outline some steps you would conduct for your analysis. A potential would maybe using the new explainability functionality as outline here: https://towardsdatascience.com/lightgbm-algorithm-an-end-to-end-review-on-the-trees-to-dataframe-method-13e8c4b74027

Hence using the trees_to_dataframe() and create_tree_digraph() methods to see how those high predictions are created could be a good option. Would you also do your analysis like this? What additional/different steps would you undertake?

Another question: Would you also classify the problem as a overfitting problem in general based on the given information so far? So for example another lead is probably L2 regularization then to check right?

If something is found, I would keep you then updated.

Thanks a lot and best regards!

microsoft / LightGBM