Open YovliDuvshani opened 2 weeks ago
Thanks for using LightGBM, and for taking the time to open an excellent report with a reproducible example! It really helped with the investitation.
Running your reproducible example with the latest development version of LightGBM, I see some logs that are helpful. Please consider including more logs in your reports in the future.
[LightGBM] [Warning] There are no meaningful features which satisfy the provided configuration. Decreasing Dataset parameters min_data_in_bin or min_data_in_leaf and re-constructing Dataset might resolve this warning.
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 1, number of used features: 0
[LightGBM] [Fatal] Check failed: (num_data) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/io/dataset.cpp, line 39
Even following that recommendation, though, I do see the same behavior you saw. I tried experimenting and found that I can reproduce this even with 1,000 samples!
import pandas as pd
import numpy as np
import lightgbm as lgb
num_samples = 1_000
for bagging_frac in [0.99, 0.75, 0.5, 0.4, 0.3, 0.2, 0.1, 1e-02, 1e-03, 1e-04, 1e-05, 1e-06]:
try:
bst = lgb.train(
params={
"seed": 1,
"bagging_fraction": bagging_frac,
"bagging_freq": 5,
"verbose": -1
},
train_set=lgb.Dataset(
data=pd.DataFrame({
"FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
"FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
}),
label=np.linspace(start=10.0, stop=80.0, num=num_samples),
)
)
status = "success"
except lgb.basic.LightGBMError:
status = "fail"
print(f"bagging_frac = {bagging_frac}: {status}")
# bagging_frac = 0.99: success
# bagging_frac = 0.75: success
# bagging_frac = 0.5: success
# bagging_frac = 0.4: success
# bagging_frac = 0.3: success
# bagging_frac = 0.2: success
# bagging_frac = 0.1: success
# bagging_frac = 0.01: success
# bagging_frac = 0.001: success
# bagging_frac = 0.0001: fail
# bagging_frac = 1e-05: fail
# bagging_frac = 1e-06: fail
Interestingly, if I remove bagging_freq
, all of these cases pass.
So it looks to me that this check could be triggered under the following mix of conditions:
bagging_freq > 1
bagging_fraction < 0.1/num_data
I tested that with an even bigger dataset... I can even trigger this failure for a dataset with 100,000 observations!!
num_samples = 100_000
bst = lgb.train(
params={
"seed": 1,
"bagging_fraction": 0.1/num_samples,
"bagging_freq": 5,
"verbose": -1
},
train_set=lgb.Dataset(
data=pd.DataFrame({
"FEATURE_1": np.linspace(start=1.0, stop=100.0, num=num_samples),
"FEATURE_2": np.linspace(start=12.0, stop=25.0, num=num_samples),
}),
label=np.linspace(start=10.0, stop=80.0, num=num_samples),
)
)
# lightgbm.basic.LightGBMError: Check failed: (num_data) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/io/dataset.cpp, line 39 .
This definitely looks like a bug to me, and not necessarily one that would only affect small datasets.
Some other minor notes...
*we regularly train millions of models (with the same hyper-parameter set) and cannot guarantee that the amount of training samples exceeds 1 for all of them.
Very interesting application! Can you share any more about the real-world reason(s) that you are training "millions of models" with the same hyper-parameters? I have some ideas about situations where that might happen, but knowing more precisely what you're trying to accomplish would help us to recommend alternatives.
For example, if this is some sort of consumer app generating predictions on user-specific data (like a fitness tracker), then training a LightGBM model is probably unnecessary for such a small amount of data (as you sort of mentioned), and you might want to do something else when there is a small amount of data, like:
I've updated your post to use the text of the error message you observed instead of an image, so it can be found from search engines by other people hitting that error. Please see https://meta.stackoverflow.com/questions/285551/why-should-i-not-upload-images-of-code-data-errors for more discussion of that practice.
Thanks for the quick answer!
A few more words on our application:
From your experiment we observe as well that as long as num_data * bagging_fraction < 1
then the training runs through.
If this assumption is correct, there's already a decent solution at hand for us. We can define the parameter bagging_fraction
based on the amount of training samples available the following way:
bagging_fraction = max(base_bagging_fraction, 1/num_samples)
. Still very much open to any alternatives you would deem more suitable.
A few more words on our application
This is very very interesting, thanks so much for the details! And thanks for choosing LightGBM for this important application, we'll do our best to support you 😊
if I remove
bagging_freq
, all of these cases pass.
I looked into this some more, and I realize I forgot something very important.... bagging is only enabled if you bagging_fraction < 1.0
AND bagging_freq > 0
. That explains why bagging_freq
was necessary to reproduce this behavior.
That's described at https://lightgbm.readthedocs.io/en/latest/Parameters.html#bagging_fraction.
as long as `num_data bagging_fraction < 1` then the training runs through.*
By "runs through", did you mean "fails"? Or did you maybe mean to use >
instead of <
?
I think that is what's happening here ... if you set bagging_fraction
such that num_data * bagging_fraction < 1
, this error will be triggered.
This makes sense... you're asking LightGBM to do something impossible.
I think LightGBM's behavior in this situation should be changed in the following ways:
max(num_data * bagging_fraction, 1)
bagging_fraction
valueThe case where you train on a single sample is unlikely to be a particularly useful model, and under LightGBM's default setting of min_data_in_leaf=5, min_data_in_bin=20
, it'll just be basically the average of the target. But having training produce some model in this situation would be consistent with how other similar situations are handled in LightGBM (e.g. when there are 0 informative features or 0 splits which satisfy min_gain_to_split
).
there's already a decent solution at hand for us. We can define the parameter
bagging_fraction
based on the amount of training samples available
Yes, this is definitely a good idea! I didn't suggest it because your post included the constraint that you wanted to use identical hyperparameters for every model.
There are a few other parameters whose values you might want to change based on the number of samples:
min_data_in_bin
min_data_in_leaf
min_data_per_group
(only relevant for categorical variables)You might find the discussion in #5194 relevant to this.
Hi again!
By "runs through", did you mean "fails"? Or did you maybe mean to use > instead of <?
Sorry that was mistake, I actually meant: "If num_data * bagging_fraction >= 1
then the training succeeds".
Yes, this is definitely a good idea! I didn't suggest it because your post included the constraint that you wanted to use identical hyperparameters for every model.
Currently, we do have the same standard set used for all items and would prefer to not have it parametrisable per item (for simplicity reasons) but it's not a hard limitation. I think, we can accept making the bagging_fraction
parametrisable for this edge case knowing it would have no impact in >99% of the cases.
Thanks again a lot for the help! :) No more questions coming from me.
Ok great, thanks for the excellent report and for sharing so much information with me!
We'll leave this open to track the work I suggested in https://github.com/microsoft/LightGBM/issues/6622#issuecomment-2314217758. Any interest in trying to contribute that? It'd require changes only on the C/C++ side of the project.
No worries if not, I'll have some time in the near future to attempt it.
Sorry for the late answer, i have unfortunately no experience with C/C++ so it would be challenging for me. Gotta pass on that.
No problem! Thanks again for the great report and interesting discussion. We'll work on a fix for this.
Hello,
We've recently encountered a problematic edge case with lightgbm. When simultaneously using bagging and training on a single data point, the model training fails. Our expectations would have been that the model disregards any bagging mechanisms.
While training a model on a single data point is surely questionable from an analytical point of view, we regularly train millions of models (with the same hyper-parameter set) and cannot guarantee that the amount of training samples exceeds 1 for all of them.
Is there any rationales behind this behaviour? How would you reckon to best go about this one?
Reproducible example
Executing this code snippet leads to this error:
But by setting bagging_fraction to 1, the model is correctly trained (and has a single leaf with output 1).
Environment info
python=3.10 pandas=2.2.2 lightgbm=4.5.0
Additional Comments
It seems like the error is raised when
bagging_fraction * num_samples < 1