Open borisRa opened 2 years ago
@borisRa Thanks for using LightGBM? Does the Cannot change linear_tree after constructed Dataset handle
occurs together with the segment fault? Is it possible to provide us a minimal reproducible example? That would be very helpful.
Same here; simply creating a new dataset and then enabling linear-tree with goss causes a segfault.
Here's test parameters:
params = {
'verbose': -1,
'objective': 'binary',
'seed': 0,
'max_depth': 6,
'num_leaves': 63,
'learning_rate': 0.08,
'min_data_in_leaf': 100,
'feature_freq': 1,
'feature_fraction': 0.5,
'lambda_l1': 0.25,
'lambda_l2': 0.25,
'linear_tree': True,
'boosting': 'goss',
}
@StrikerRUS Maybe it's due to .subset()
, here's a minimal example that segfaults:
import lightgbm as lgb
import numpy as np
np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
lds = lgb.Dataset(X, y)
lds = lds.subset((np.random.random(100) > 0.5).nonzero()[0]) # <---
lgb.train({'boosting': 'goss', 'linear_tree': True}, lds, 1)
Could this be marked as a confirmed bug then?
@borisRa Thanks for using LightGBM? Does the
Cannot change linear_tree after constructed Dataset handle
occurs together with the segment fault? Is it possible to provide us a minimal reproducible example? That would be very helpful.
No it Doesn't occurs together.
Using the code bellow with Optuna the kernel is killed and getting this message Segmentation fault (core dumped)
.
This is my code :
def objective(trial):
param_grid = {
"n_jobs": trial.suggest_categorical('n_jobs', [-1]),
"boosting_type" : trial.suggest_categorical('boosting_type', ['gbdt']),
"linear_tree" : trial.suggest_categorical('linear_tree', [True]),
"objective" : trial.suggest_categorical('objective', ['tweedie']),
"metric": trial.suggest_categorical('metric', [['mae', 'rmse', 'tweedie', 'mape']]),
"n_estimators": trial.suggest_int("n_estimators", 100, 2000, log =True),
"learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
"max_depth": trial.suggest_int("max_depth", 3, 12),
"num_leaves": trial.suggest_int("num_leaves", 20, 3000, log=True),
"lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5),
"lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5),
"min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15,step=1),
"bagging_fraction": trial.suggest_float("bagging_fraction", 0.2, 1.0, step=0.1 ),
"bagging_freq": trial.suggest_int("bagging_freq", 1, 7 ),
"feature_fraction": trial.suggest_float("feature_fraction", 0.2, 1.0, step=0.1 ),
"verbosity" : -1
}
#Adjust per 'objective'
if losses_function_optimize.lower() =='tweedie':
param_grid['tweedie_variance_power'] = trial.suggest_float("tweedie_variance_power", 1.01, 1.99, step=0.1)
if param_grid["linear_tree"] == True:
param_grid['linear_lambda'] = trial.suggest_float("linear_lambda", 0.01, 10.0, log= True)
param_grid.pop("lambda_l1")
pruning_callback = optuna.integration.LightGBMPruningCallback(trial = trial, metric = losses_function_optimize.lower()) #LightGBM metrics : https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric
lgbm_model = lgbm.train(params = param_grid, train_set = train_data_Dataset, valid_sets =[valid_data_Dataset], callbacks=[pruning_callback] , verbose_eval= -1 )
preds = lgbm_model.predict(valid_data[x_columns])
performance_metric = function_to_estimate_performance(y_true = valid_data[y_label], y_pred = preds)
return(performance_metric)
#############################################################
#Set data in LightGBM format
#############################################################
train_data_Dataset = lgbm.Dataset(train_data[x_columns], label=train_data[y_label])
train_data_Dataset.save_binary('train_data.bin')
train_data_Dataset = lgbm.Dataset('train_data.bin')
valid_data_Dataset = lgbm.Dataset(valid_data[x_columns] , label = valid_data[y_label] )
#############################################################
#Create a study object and optimize the objective function
#############################################################
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=hyper_parameter_max_evals)
Hi, Any update ?
@borisRa
Sorry for the late response. For the error Cannot change linear_tree after constructed Dataset handle
, you need to reconstruct the dataset for each hyperparameter before calling train on the dataset. When a lgbm.Dataset
is passed to the training method, LightGBM do some internal preprocessing for the lgbm.Dataset
and some preprocessing are done according to related hyperparameters, for example max_bin
, linear_tree
, etc.
I'll find some time to investigate the SegmentFault these days.
Thanks again for reporting the issue.
Hi,
When this bug is going to be fixed ?
Thanks, Boris
Please do not leave comments in this project like "any update?" or "when will this be fixed?". See the description at https://github.com/golang/go/wiki/NoPlusOne/ for an explanation of how this damaging to open source projects.
The project's source code is all freepy available and you are welcome to investigate this yourself and report your findings. If you are not able or willing to do that, then please wait patiently for a maintainer or other community member to put effort into resolving this.
Please do not leave comments in this project like "any update?" or "when will this be fixed?". See the description at https://github.com/golang/go/wiki/NoPlusOne/ for an explanation of how this damaging to open source projects.
The project's source code is all freepy available and you are welcome to investigate this yourself and report your findings. If you are not able or willing to do that, then please wait patiently for a maintainer or other community member to put effort into resolving this.
Thanks for letting me know
I looked into this and @aldanor is correct, the problem is caused by the subset
code. It's not related to the goss
parameter; the segfault still occurs when removing that parameter from @aldanor's example as follows:
import lightgbm as lgb
import numpy as np
np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
lds = lgb.Dataset(X, y)
lds = lds.subset((np.random.random(100) > 0.5).nonzero()[0]) # <---
lgb.train({'linear_tree': True}, lds, 1)
The issue is as follows. When train
is called, the constructor of the Booster calls Dataset.construct
(this is the first call toDataset.construct
). Since we are constructing a subset, the code calls Dataset._update_params
to update the parameters from the reference (i.e. superset Dataset). But since the reference Dataset's parameters are empty, no update takes place. So the subset dataset has linear_tree=True
but the reference parameter set is empty. Then, the reference (superset) dataset gets constructed, but since the linear_tree
parameter is not set for the reference, it does not load the raw feature data required for growing linear trees. The subset Dataset is constructed by copying from the data of the reference dataset, but this segfaults when it tries to copy the raw feature data which does not exist.
The same type of issue also occurs with other parameters. Although in this case there is no segfault, the example below shows that the max_bin
parameter is ignored when constructing a subset dataset.
import lightgbm as lgb
import numpy as np
np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
subset_ind = (np.random.random(100) > 0.5).nonzero()[0]
# train on subsetted data: max_bin is ignored and prediction has 93 unique values
lds = lgb.Dataset(X, y)
lds = lds.subset(subset_ind)
est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100)
preds = est.predict(X)
print(len(np.unique(preds)))
# construct subset manually: max_bin is used, and prediction has 1 unique value
lds = lgb.Dataset(X[subset_ind], y[subset_ind])
est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100)
preds = est.predict(X)
print(len(np.unique(preds)))
I can see a couple of options for resolving this, but neither seems entirely satisfactory.
linear_tree
parameter to be disregarded, which is not the expected behaviour in the example above. (This would also cause problems with lgb.cv
which uses the subset
method.)I looked into this and @aldanor is correct, the problem is caused by the
subset
code. It's not related to thegoss
parameter; the segfault still occurs when removing that parameter from @aldanor's example as follows:import lightgbm as lgb import numpy as np np.random.seed(0) X, y = np.random.random((100, 10)), np.random.random(100) > 0.5 lds = lgb.Dataset(X, y) lds = lds.subset((np.random.random(100) > 0.5).nonzero()[0]) # <--- lgb.train({'linear_tree': True}, lds, 1)
The issue is as follows. When
train
is called, the constructor of the Booster callsDataset.construct
(this is the first call toDataset.construct
). Since we are constructing a subset, the code callsDataset._update_params
to update the parameters from the reference (i.e. superset Dataset). But since the reference Dataset's parameters are empty, no update takes place. So the subset dataset haslinear_tree=True
but the reference parameter set is empty. Then, the reference (superset) dataset gets constructed, but since thelinear_tree
parameter is not set for the reference, it does not load the raw feature data required for growing linear trees. The subset Dataset is constructed by copying from the data of the reference dataset, but this segfaults when it tries to copy the raw feature data which does not exist.The same type of issue also occurs with other parameters. Although in this case there is no segfault, the example below shows that the
max_bin
parameter is ignored when constructing a subset dataset.import lightgbm as lgb import numpy as np np.random.seed(0) X, y = np.random.random((100, 10)), np.random.random(100) > 0.5 subset_ind = (np.random.random(100) > 0.5).nonzero()[0] # train on subsetted data: max_bin is ignored and prediction has 93 unique values lds = lgb.Dataset(X, y) lds = lds.subset(subset_ind) est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100) preds = est.predict(X) print(len(np.unique(preds))) # construct subset manually: max_bin is used, and prediction has 1 unique value lds = lgb.Dataset(X[subset_ind], y[subset_ind]) est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100) preds = est.predict(X) print(len(np.unique(preds)))
I can see a couple of options for resolving this, but neither seems entirely satisfactory.
- Change the code so that the reference dataset parameters always override the subset parameters, even if the reference dataset parameters are empty. This doesn't seem like a good solution, because it causes the
linear_tree
parameter to be disregarded, which is not the expected behaviour in the example above. (This would also cause problems withlgb.cv
which uses thesubset
method.)- Change the code so that the subset parameters always override the reference dataset parameters if the reference dataset has empty parameter set and is not yet constructed. This seems like a better solution, but I think the behaviour would still be somewhat confusing to users since the logic about which set of parameters gets used under which conditions is not obvious.
Thanks !
So how do you recommend to proceed to be able to use **linear_tree =True**
? maybe to raise an warning which parameters aren't being used during the train ?
@borisRa As I said above, I'm not sure what the best way is to fix this in the LightGBM code. I'll wait to hear from other maintainers on this. As a workaround, I think you can use linear_tree=True
provided you don't attempt to train on subset Datasets. I don't know whether this will work with Optuna (since Optuna may be using the subset
functionality internally).
Description
Hi,
Working with parameter : linear_tree = True
The ipython core is dumping with this message :
Segmentation fault (core dumped)
And working with Optuna when
linear_tree
is a parameter like this :"linear_tree" : trial.suggest_categorical('linear_tree', [True,False])
Getting thie error:
Cannot change linear_tree after constructed Dataset handle
Environment info
LightGBM version : 3.3.2
Please assist , Boris