microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.66k stars 3.83k forks source link

Segmentation fault (core dumped) & Cannot change linear_tree after constructed Dataset handle #5056

Open borisRa opened 2 years ago

borisRa commented 2 years ago

Description

Hi,

Working with parameter : linear_tree = True

The ipython core is dumping with this message : Segmentation fault (core dumped)

And working with Optuna when linear_treeis a parameter like this : "linear_tree" : trial.suggest_categorical('linear_tree', [True,False])

Getting thie error: Cannot change linear_tree after constructed Dataset handle

Environment info

LightGBM version : 3.3.2

Please assist , Boris

shiyu1994 commented 2 years ago

@borisRa Thanks for using LightGBM? Does the Cannot change linear_tree after constructed Dataset handle occurs together with the segment fault? Is it possible to provide us a minimal reproducible example? That would be very helpful.

aldanor commented 2 years ago

Same here; simply creating a new dataset and then enabling linear-tree with goss causes a segfault.

Here's test parameters:

params = {
    'verbose': -1,
    'objective': 'binary',
    'seed': 0,
    'max_depth': 6,
    'num_leaves': 63,
    'learning_rate': 0.08,
    'min_data_in_leaf': 100,
    'feature_freq': 1,
    'feature_fraction': 0.5,
    'lambda_l1': 0.25,
    'lambda_l2': 0.25,
    'linear_tree': True,
    'boosting': 'goss',
}
aldanor commented 2 years ago

@StrikerRUS Maybe it's due to .subset(), here's a minimal example that segfaults:

import lightgbm as lgb
import numpy as np

np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
lds = lgb.Dataset(X, y)
lds = lds.subset((np.random.random(100) > 0.5).nonzero()[0])  # <---
lgb.train({'boosting': 'goss', 'linear_tree': True}, lds, 1)

Could this be marked as a confirmed bug then?

borisRa commented 2 years ago

@borisRa Thanks for using LightGBM? Does the Cannot change linear_tree after constructed Dataset handle occurs together with the segment fault? Is it possible to provide us a minimal reproducible example? That would be very helpful.

No it Doesn't occurs together. Using the code bellow with Optuna the kernel is killed and getting this message Segmentation fault (core dumped) .

This is my code :

    def objective(trial):

        param_grid = {

            "n_jobs": trial.suggest_categorical('n_jobs', [-1]),
            "boosting_type" : trial.suggest_categorical('boosting_type', ['gbdt']),

            "linear_tree" : trial.suggest_categorical('linear_tree', [True]),

            "objective" : trial.suggest_categorical('objective', ['tweedie']),
            "metric":  trial.suggest_categorical('metric', [['mae', 'rmse', 'tweedie', 'mape']]), 

            "n_estimators": trial.suggest_int("n_estimators", 100, 2000,  log =True),

            "learning_rate": trial.suggest_loguniform("learning_rate", 0.01, 0.3),
            "max_depth": trial.suggest_int("max_depth", 3, 12), 

            "num_leaves": trial.suggest_int("num_leaves", 20, 3000, log=True),

            "lambda_l1": trial.suggest_int("lambda_l1", 0, 100, step=5), 
            "lambda_l2": trial.suggest_int("lambda_l2", 0, 100, step=5), 
            "min_gain_to_split": trial.suggest_float("min_gain_to_split", 0, 15,step=1),

            "bagging_fraction": trial.suggest_float("bagging_fraction", 0.2, 1.0, step=0.1  ),
            "bagging_freq": trial.suggest_int("bagging_freq", 1, 7  ), 
            "feature_fraction": trial.suggest_float("feature_fraction", 0.2, 1.0, step=0.1  ),
            "verbosity" : -1 
            }

        #Adjust per 'objective'     
        if  losses_function_optimize.lower() =='tweedie':
            param_grid['tweedie_variance_power'] =  trial.suggest_float("tweedie_variance_power", 1.01, 1.99, step=0.1)

        if param_grid["linear_tree"] == True:

            param_grid['linear_lambda'] =  trial.suggest_float("linear_lambda", 0.01, 10.0, log= True)                  
            param_grid.pop("lambda_l1")

        pruning_callback = optuna.integration.LightGBMPruningCallback(trial = trial, metric = losses_function_optimize.lower()) #LightGBM metrics : https://lightgbm.readthedocs.io/en/latest/Parameters.html#metric

        lgbm_model = lgbm.train(params = param_grid, train_set = train_data_Dataset, valid_sets =[valid_data_Dataset], callbacks=[pruning_callback] , verbose_eval= -1 )

        preds = lgbm_model.predict(valid_data[x_columns])
        performance_metric = function_to_estimate_performance(y_true = valid_data[y_label], y_pred = preds)

        return(performance_metric)

    #############################################################   
    #Set data in LightGBM format
    #############################################################   

    train_data_Dataset = lgbm.Dataset(train_data[x_columns],   label=train_data[y_label]) 
    train_data_Dataset.save_binary('train_data.bin') 
    train_data_Dataset = lgbm.Dataset('train_data.bin') 

    valid_data_Dataset = lgbm.Dataset(valid_data[x_columns] , label = valid_data[y_label] )

    #############################################################   
    #Create a study object and optimize the objective function
    #############################################################   

    study = optuna.create_study(direction="minimize")   
    study.optimize(objective, n_trials=hyper_parameter_max_evals)   
borisRa commented 2 years ago

Hi, Any update ?

shiyu1994 commented 2 years ago

@borisRa Sorry for the late response. For the error Cannot change linear_tree after constructed Dataset handle, you need to reconstruct the dataset for each hyperparameter before calling train on the dataset. When a lgbm.Dataset is passed to the training method, LightGBM do some internal preprocessing for the lgbm.Dataset and some preprocessing are done according to related hyperparameters, for example max_bin, linear_tree, etc.

I'll find some time to investigate the SegmentFault these days.

Thanks again for reporting the issue.

borisRa commented 2 years ago

Hi,

When this bug is going to be fixed ?

Thanks, Boris

jameslamb commented 2 years ago

Please do not leave comments in this project like "any update?" or "when will this be fixed?". See the description at https://github.com/golang/go/wiki/NoPlusOne/ for an explanation of how this damaging to open source projects.

The project's source code is all freepy available and you are welcome to investigate this yourself and report your findings. If you are not able or willing to do that, then please wait patiently for a maintainer or other community member to put effort into resolving this.

borisRa commented 2 years ago

Please do not leave comments in this project like "any update?" or "when will this be fixed?". See the description at https://github.com/golang/go/wiki/NoPlusOne/ for an explanation of how this damaging to open source projects.

The project's source code is all freepy available and you are welcome to investigate this yourself and report your findings. If you are not able or willing to do that, then please wait patiently for a maintainer or other community member to put effort into resolving this.

Thanks for letting me know

btrotta commented 2 years ago

I looked into this and @aldanor is correct, the problem is caused by the subset code. It's not related to the goss parameter; the segfault still occurs when removing that parameter from @aldanor's example as follows:

import lightgbm as lgb
import numpy as np

np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
lds = lgb.Dataset(X, y)
lds = lds.subset((np.random.random(100) > 0.5).nonzero()[0])  # <---
lgb.train({'linear_tree': True}, lds, 1)

The issue is as follows. When train is called, the constructor of the Booster calls Dataset.construct (this is the first call toDataset.construct). Since we are constructing a subset, the code calls Dataset._update_params to update the parameters from the reference (i.e. superset Dataset). But since the reference Dataset's parameters are empty, no update takes place. So the subset dataset has linear_tree=True but the reference parameter set is empty. Then, the reference (superset) dataset gets constructed, but since the linear_tree parameter is not set for the reference, it does not load the raw feature data required for growing linear trees. The subset Dataset is constructed by copying from the data of the reference dataset, but this segfaults when it tries to copy the raw feature data which does not exist.

The same type of issue also occurs with other parameters. Although in this case there is no segfault, the example below shows that the max_bin parameter is ignored when constructing a subset dataset.

import lightgbm as lgb
import numpy as np

np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
subset_ind = (np.random.random(100) > 0.5).nonzero()[0]
# train on subsetted data: max_bin is ignored and prediction has 93 unique values
lds = lgb.Dataset(X, y)
lds = lds.subset(subset_ind)
est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100)
preds = est.predict(X)
print(len(np.unique(preds)))
# construct subset manually: max_bin is used, and prediction has 1 unique value
lds = lgb.Dataset(X[subset_ind], y[subset_ind])
est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100)
preds = est.predict(X)
print(len(np.unique(preds)))

I can see a couple of options for resolving this, but neither seems entirely satisfactory.

  1. Change the code so that the reference dataset parameters always override the subset parameters, even if the reference dataset parameters are empty. This doesn't seem like a good solution, because it causes the linear_tree parameter to be disregarded, which is not the expected behaviour in the example above. (This would also cause problems with lgb.cv which uses the subset method.)
  2. Change the code so that the subset parameters always override the reference dataset parameters if the reference dataset has empty parameter set and is not yet constructed. This seems like a better solution, but I think the behaviour would still be somewhat confusing to users since the logic about which set of parameters gets used under which conditions is not obvious.
borisRa commented 2 years ago

I looked into this and @aldanor is correct, the problem is caused by the subset code. It's not related to the goss parameter; the segfault still occurs when removing that parameter from @aldanor's example as follows:

import lightgbm as lgb
import numpy as np

np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
lds = lgb.Dataset(X, y)
lds = lds.subset((np.random.random(100) > 0.5).nonzero()[0])  # <---
lgb.train({'linear_tree': True}, lds, 1)

The issue is as follows. When train is called, the constructor of the Booster calls Dataset.construct (this is the first call toDataset.construct). Since we are constructing a subset, the code calls Dataset._update_params to update the parameters from the reference (i.e. superset Dataset). But since the reference Dataset's parameters are empty, no update takes place. So the subset dataset has linear_tree=True but the reference parameter set is empty. Then, the reference (superset) dataset gets constructed, but since the linear_tree parameter is not set for the reference, it does not load the raw feature data required for growing linear trees. The subset Dataset is constructed by copying from the data of the reference dataset, but this segfaults when it tries to copy the raw feature data which does not exist.

The same type of issue also occurs with other parameters. Although in this case there is no segfault, the example below shows that the max_bin parameter is ignored when constructing a subset dataset.

import lightgbm as lgb
import numpy as np

np.random.seed(0)
X, y = np.random.random((100, 10)), np.random.random(100) > 0.5
subset_ind = (np.random.random(100) > 0.5).nonzero()[0]
# train on subsetted data: max_bin is ignored and prediction has 93 unique values
lds = lgb.Dataset(X, y)
lds = lds.subset(subset_ind)
est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100)
preds = est.predict(X)
print(len(np.unique(preds)))
# construct subset manually: max_bin is used, and prediction has 1 unique value
lds = lgb.Dataset(X[subset_ind], y[subset_ind])
est = lgb.train({'max_bin': 2, 'verbose': -1, 'seed': 0}, lds, num_boost_round=100)
preds = est.predict(X)
print(len(np.unique(preds)))

I can see a couple of options for resolving this, but neither seems entirely satisfactory.

  1. Change the code so that the reference dataset parameters always override the subset parameters, even if the reference dataset parameters are empty. This doesn't seem like a good solution, because it causes the linear_tree parameter to be disregarded, which is not the expected behaviour in the example above. (This would also cause problems with lgb.cv which uses the subset method.)
  2. Change the code so that the subset parameters always override the reference dataset parameters if the reference dataset has empty parameter set and is not yet constructed. This seems like a better solution, but I think the behaviour would still be somewhat confusing to users since the logic about which set of parameters gets used under which conditions is not obvious.

Thanks ! So how do you recommend to proceed to be able to use **linear_tree =True** ? maybe to raise an warning which parameters aren't being used during the train ?

btrotta commented 2 years ago

@borisRa As I said above, I'm not sure what the best way is to fix this in the LightGBM code. I'll wait to hear from other maintainers on this. As a workaround, I think you can use linear_tree=True provided you don't attempt to train on subset Datasets. I don't know whether this will work with Optuna (since Optuna may be using the subset functionality internally).