Training fails bagging_freq > 1 and bagging_fraction is very small

YovliDuvshani commented 2 weeks ago

Hello,

We've recently encountered a problematic edge case with lightgbm. When simultaneously using bagging and training on a single data point, the model training fails. Our expectations would have been that the model disregards any bagging mechanisms.

While training a model on a single data point is surely questionable from an analytical point of view, we regularly train millions of models (with the same hyper-parameter set) and cannot guarantee that the amount of training samples exceeds 1 for all of them.

Is there any rationales behind this behaviour? How would you reckon to best go about this one?

Reproducible example

import pandas as pd
import lightgbm as lgbm

data = pd.DataFrame({"FEATURE_1": [0], "FEATURE_2": [1]})
label = pd.Series([1])
train_dataset = lgbm.Dataset(data=data, label=label)

params = {
    "seed": 1,
    "bagging_fraction": 0.5,
    "bagging_freq": 5,
}

lgbm.train(params=params, train_set=train_dataset)

Executing this code snippet leads to this error:

lightgbm.basic.LightGBMError: Check failed: (num_data) > (0)

But by setting bagging_fraction to 1, the model is correctly trained (and has a single leaf with output 1).

Environment info

python=3.10 pandas=2.2.2 lightgbm=4.5.0

Additional Comments

It seems like the error is raised when bagging_fraction * num_samples < 1

jameslamb commented 2 weeks ago

Thanks for using LightGBM, and for taking the time to open an excellent report with a reproducible example! It really helped with the investitation.

Running your reproducible example with the latest development version of LightGBM, I see some logs that are helpful. Please consider including more logs in your reports in the future.

[LightGBM] [Warning] There are no meaningful features which satisfy the provided configuration. Decreasing Dataset parameters min_data_in_bin or min_data_in_leaf and re-constructing Dataset might resolve this warning.
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 1, number of used features: 0
[LightGBM] [Fatal] Check failed: (num_data) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/io/dataset.cpp, line 39

Even following that recommendation, though, I do see the same behavior you saw. I tried experimenting and found that I can reproduce this even with 1,000 samples!

import pandas as pd
import numpy as np
import lightgbm as lgb

num_samples = 1_000

for bagging_frac in [0.99, 0.75, 0.5, 0.4, 0.3, 0.2, 0.1, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06]:
    try:
        bst = lgb.train(
            params={
                "seed": 1,
                "bagging_fraction": bagging_frac,
                "bagging_freq": 5,
                "verbose": -1
            },
            train_set=lgb.Dataset(
                data=pd.DataFrame({
                    "FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
                    "FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
                }),
                label=np.linspace(start=10.0, stop=80.0, num=num_samples),
            )
        )
        status = "success"
    except lgb.basic.LightGBMError:
        status = "fail"
    print(f"bagging_frac = {bagging_frac}: {status}")

# bagging_frac = 0.99: success
# bagging_frac = 0.75: success
# bagging_frac = 0.5: success
# bagging_frac = 0.4: success
# bagging_frac = 0.3: success
# bagging_frac = 0.2: success
# bagging_frac = 0.1: success
# bagging_frac = 0.01: success
# bagging_frac = 0.001: success
# bagging_frac = 0.0001: fail
# bagging_frac = 1e-05: fail
# bagging_frac = 1e-06: fail

Interestingly, if I remove bagging_freq, all of these cases pass.

So it looks to me that this check could be triggered under the following mix of conditions:

bagging_freq > 1
bagging_fraction < 0.1/num_data

I tested that with an even bigger dataset... I can even trigger this failure for a dataset with 100,000 observations!!

num_samples = 100_000

bst = lgb.train(
    params={
        "seed": 1,
        "bagging_fraction": 0.1/num_samples,
        "bagging_freq": 5,
        "verbose": -1
    },
    train_set=lgb.Dataset(
        data=pd.DataFrame({
            "FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
            "FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
        }),
        label=np.linspace(start=10.0, stop=80.0, num=num_samples),
    )
)
# lightgbm.basic.LightGBMError: Check failed: (num_data) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/io/dataset.cpp, line 39 .

This definitely looks like a bug to me, and not necessarily one that would only affect small datasets.

jameslamb commented 2 weeks ago

Some other minor notes...

*we regularly train millions of models (with the same hyper-parameter set) and cannot guarantee that the amount of training samples exceeds 1 for all of them.

Very interesting application! Can you share any more about the real-world reason(s) that you are training "millions of models" with the same hyper-parameters? I have some ideas about situations where that might happen, but knowing more precisely what you're trying to accomplish would help us to recommend alternatives.

For example, if this is some sort of consumer app generating predictions on user-specific data (like a fitness tracker), then training a LightGBM model is probably unnecessary for such a small amount of data (as you sort of mentioned), and you might want to do something else when there is a small amount of data, like:

just take the average of the target (or the simple majority for classification)
generate predictions from a pre-trained model (that was trained on pooled data)
use a simpler model (like OLS for regression or SVM for classification)

I've updated your post to use the text of the error message you observed instead of an image, so it can be found from search engines by other people hitting that error. Please see https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors for more discussion of that practice.

YovliDuvshani commented 2 weeks ago

Thanks for the quick answer!

A few more words on our application:

It's one of the biggest use case of a worldwide retailer so the amount of data we have at our disposal is gigantic.
The main purpose of the use case is to forecast the sales demand. This sales demand varies heavily between items which is why we made the decision of having different models for each item.
A "regular" item would still be trained on about 100 000 samples.
We decided to have the same hyper-parameter set for simplicity reason but it could be that we come back on it at some point.

From your experiment we observe as well that as long as num_data * bagging_fraction < 1 then the training runs through. If this assumption is correct, there's already a decent solution at hand for us. We can define the parameter bagging_fraction based on the amount of training samples available the following way: bagging_fraction = max(base_bagging_fraction, 1/num_samples). Still very much open to any alternatives you would deem more suitable.

jameslamb commented 2 weeks ago

A few more words on our application

This is very very interesting, thanks so much for the details! And thanks for choosing LightGBM for this important application, we'll do our best to support you 😊

if I remove bagging_freq, all of these cases pass.

I looked into this some more, and I realize I forgot something very important.... bagging is only enabled if you bagging_fraction < 1.0 AND bagging_freq > 0 . That explains why bagging_freq was necessary to reproduce this behavior.

That's described at https://lightgbm.readthedocs.io/en/latest/Parameters.html#bagging_fraction.

as long as `num_data bagging_fraction < 1` then the training runs through.*

By "runs through", did you mean "fails"? Or did you maybe mean to use > instead of <?

I think that is what's happening here ... if you set bagging_fraction such that num_data * bagging_fraction < 1, this error will be triggered.

code that shows that (click me)

```python import numpy as np import pandas as pd import lightgbm as lgb def _attempt_to_train(num_samples): bagging_fraction = 0.99 / num_samples param_str = f"num_samples={num_samples}, bagging_frac={bagging_fraction}" try: bst = lgb.train( params={ "seed": 1, "bagging_fraction": bagging_fraction, "bagging_freq": 1, "verbose": -1 }, train_set=lgb.Dataset( data=pd.DataFrame({ "FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples), "FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples), }), label=np.linspace(start=10.0, stop=80.0, num=num_samples), ) ) print(f"success ({param_str})") except lgb.basic.LightGBMError: print(f"failure ({param_str})") num_sample_vals = [ 1, 2, 100, 1_000, 10_000, 100_000 ] for n in num_sample_vals: _attempt_to_train(n) ```

This makes sense... you're asking LightGBM to do something impossible.

I think LightGBM's behavior in this situation should be changed in the following ways:

set the number of samples in bagging to max(num_data * bagging_fraction, 1)
issue an informative warning-level log message suggesting a different bagging_fraction value

The case where you train on a single sample is unlikely to be a particularly useful model, and under LightGBM's default setting of min_data_in_leaf=5, min_data_in_bin=20, it'll just be basically the average of the target. But having training produce some model in this situation would be consistent with how other similar situations are handled in LightGBM (e.g. when there are 0 informative features or 0 splits which satisfy min_gain_to_split).

there's already a decent solution at hand for us. We can define the parameter bagging_fraction based on the amount of training samples available

Yes, this is definitely a good idea! I didn't suggest it because your post included the constraint that you wanted to use identical hyperparameters for every model.

There are a few other parameters whose values you might want to change based on the number of samples:

min_data_in_bin
min_data_in_leaf
min_data_per_group (only relevant for categorical variables)

You might find the discussion in #5194 relevant to this.

YovliDuvshani commented 2 weeks ago

Hi again!

By "runs through", did you mean "fails"? Or did you maybe mean to use > instead of <?

Sorry that was mistake, I actually meant: "If num_data * bagging_fraction >= 1 then the training succeeds".

Yes, this is definitely a good idea! I didn't suggest it because your post included the constraint that you wanted to use identical hyperparameters for every model.

Currently, we do have the same standard set used for all items and would prefer to not have it parametrisable per item (for simplicity reasons) but it's not a hard limitation. I think, we can accept making the bagging_fraction parametrisable for this edge case knowing it would have no impact in >99% of the cases.

Thanks again a lot for the help! :) No more questions coming from me.

jameslamb commented 2 weeks ago

Ok great, thanks for the excellent report and for sharing so much information with me!

We'll leave this open to track the work I suggested in https://github.com/microsoft/LightGBM/issues/6622#issuecomment-2314217758. Any interest in trying to contribute that? It'd require changes only on the C/C++ side of the project.

No worries if not, I'll have some time in the near future to attempt it.

YovliDuvshani commented 1 week ago

Sorry for the late answer, i have unfortunately no experience with C/C++ so it would be challenging for me. Gotta pass on that.

jameslamb commented 1 week ago

No problem! Thanks again for the great report and interesting discussion. We'll work on a fix for this.

microsoft / LightGBM

Training fails bagging_freq > 1 and bagging_fraction is very small #6622