microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.7k stars 3.83k forks source link

Load back saved parameters with save_model to Booster object #2613

Closed everdark closed 2 years ago

everdark commented 4 years ago

Environment info

Operating System: Windows 10 (Same result on both Windows and WSL)

CPU/GPU model: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz

C++/Python/R version: Python 3.7

LightGBM version or commit hash: 2.3.1 installed by pip

Error message

Reproducible examples

import json
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

try:
    import cPickle as pickle
except BaseException:
    import pickle

print('Loading data...')
# load or create your dataset
df_train = pd.read_csv('../binary_classification/binary.train', header=None, sep='\t')
df_test = pd.read_csv('../binary_classification/binary.test', header=None, sep='\t')
W_train = pd.read_csv('../binary_classification/binary.train.weight', header=None)[0]
W_test = pd.read_csv('../binary_classification/binary.test.weight', header=None)[0]

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

num_train, num_feature = X_train.shape

# create dataset for lightgbm
# if you want to re-use data, remember to set free_raw_data=False
lgb_train = lgb.Dataset(X_train, y_train,
                        weight=W_train, free_raw_data=False)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
                       weight=W_test, free_raw_data=False)

# specify your configurations as a dict
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'binary_logloss',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0
}

# generate feature names
feature_name = ['feature_' + str(col) for col in range(num_feature)]

print('Starting training...')
# feature_name and categorical_feature
gbm = lgb.train(params,
                lgb_train,
                num_boost_round=10,
                valid_sets=lgb_train,  # eval training data
                feature_name=feature_name,
                categorical_feature=[21])

print(gbm.params) # Check params.
gbm.save_model('model.txt')

gbm2 = lgb.Booster(model_file='model.txt')
print(gbm2.params) # Nothing.

Coding example above is directly borrowed from official example advanced_example.py I've confirmed the parameters have been written to model file. Here is the trailing of the file:

parameters:
[boosting: gbdt]
[objective: binary]
[metric: binary_logloss]
[tree_learner: serial]
[device_type: cpu]
[data: ]
[valid: ]
[num_iterations: 100]
[learning_rate: 0.05]
[num_leaves: 31]
[num_threads: 0]
[max_depth: -1]
[min_data_in_leaf: 20]
[min_sum_hessian_in_leaf: 0.001]
[bagging_fraction: 0.8]
[pos_bagging_fraction: 1]
[neg_bagging_fraction: 1]
[bagging_freq: 5]
[bagging_seed: 3]
[feature_fraction: 0.9]
[feature_fraction_bynode: 1]
[feature_fraction_seed: 2]
[early_stopping_round: 0]
[first_metric_only: 0]
[max_delta_step: 0]
[lambda_l1: 0]
[lambda_l2: 0]
[min_gain_to_split: 0]
[drop_rate: 0.1]
[max_drop: 50]
[skip_drop: 0.5]
[xgboost_dart_mode: 0]
[uniform_drop: 0]
[drop_seed: 4]
[top_rate: 0.2]
[other_rate: 0.1]
[min_data_per_group: 100]
[max_cat_threshold: 32]
[cat_l2: 10]
[cat_smooth: 10]
[max_cat_to_onehot: 4]
[top_k: 20]
[monotone_constraints: ]
[feature_contri: ]
[forcedsplits_filename: ]
[forcedbins_filename: ]
[refit_decay_rate: 0.9]
[cegb_tradeoff: 1]
[cegb_penalty_split: 0]
[cegb_penalty_feature_lazy: ]
[cegb_penalty_feature_coupled: ]
[verbosity: 0]
[max_bin: 255]
[max_bin_by_feature: ]
[min_data_in_bin: 3]
[bin_construct_sample_cnt: 200000]
[histogram_pool_size: -1]
[data_random_seed: 1]
[output_model: LightGBM_model.txt]
[snapshot_freq: -1]
[input_model: ]
[output_result: LightGBM_predict_result.txt]
[initscore_filename: ]
[valid_data_initscores: ]
[pre_partition: 0]
[enable_bundle: 1]
[max_conflict_rate: 0]
[is_enable_sparse: 1]
[sparse_threshold: 0.8]
[use_missing: 1]
[zero_as_missing: 0]
[two_round: 0]
[save_binary: 0]
[header: 0]
[label_column: ]
[weight_column: ]
[group_column: ]
[ignore_column: ]
[categorical_feature: ]
[predict_raw_score: 0]
[predict_leaf_index: 0]
[predict_contrib: 0]
[num_iteration_predict: -1]
[pred_early_stop: 0]
[pred_early_stop_freq: 10]
[pred_early_stop_margin: 10]
[convert_model_language: ]
[convert_model: gbdt_prediction.cpp]
[num_class: 1]
[is_unbalance: 0]
[scale_pos_weight: 1]
[sigmoid: 1]
[boost_from_average: 1]
[reg_sqrt: 0]
[alpha: 0.9]
[fair_c: 1]
[poisson_max_delta_step: 0.7]
[tweedie_variance_power: 1.5]
[max_position: 20]
[lambdamart_norm: 1]
[label_gain: ]
[metric_freq: 1]
[is_provide_training_metric: 0]
[eval_at: ]
[multi_error_top_k: 1]
[num_machines: 1]
[local_listen_port: 12400]
[time_out: 120]
[machine_list_filename: ]
[machines: ]
[gpu_platform_id: -1]
[gpu_device_id: -1]
[gpu_use_dp: 0]

end of parameters

pandas_categorical:[]

Is this behavior by design? I found this because I'm using shap with saved model and it failed to compute shap values due to the fact that shap need to access objective in the params, which is gone if the Booster is a pre-trained and re-loaded one.

As of now my workaround is to also pass params to Booster when loading:

gbm2 = lgb.Booster(model_file='model.txt', params=params)

However I don't think this is a good practice since there is no way to make sure the passed params are consistent with the saved model.

StrikerRUS commented 4 years ago

@everdark Thanks for your report! I think it's very related to #2604 and #2208. Also it'll require something like LGBM_BoosterGetConfig or adding [out] out_config argument to the existing functions at cpp side. @guolinke

StrikerRUS commented 3 years ago

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

zyxue commented 3 years ago

is there any update on this issue?

jameslamb commented 3 years ago

is there any update on this issue?

@zyxue , thanks for your interest in LightGBM!

If you're interested in working on this feature and contributing, let us know and we'd be happy to answer questions you have.

Otherwise, you can subscribe to notifications on this issue for updates.

zyxue commented 3 years ago

hey @jameslamb , I'm interested in giving it a try. Do you have guidance on where to start?

jameslamb commented 3 years ago

Thanks @zyxue !

I'd start by reading the issues @StrikerRUS mentioned at https://github.com/microsoft/LightGBM/issues/2613#issuecomment-562216072, just to get a better understanding of this part of the code base

Next, I'd add a test to https://github.com/microsoft/LightGBM/blob/da98f24711a2faab17f94e5b2a636e6609c93fa6/tests/python_package_test/test_basic.py using the reproducible example provided by @everdark. That test should fail until your changes are made.

Next, try to work through changes on the C++ side based on @StrikerRUS's statement https://github.com/microsoft/LightGBM/issues/2613#issuecomment-562216072.

it'll require something like LGBM_BoosterGetConfig or adding [out] out_config argument to the existing functions at cpp side


Here's the relevant Python code that's called to create a Booster from a model .txt file. Note that it calls LGBM_BoosterCreateFromModelfile().

https://github.com/microsoft/LightGBM/blob/da98f24711a2faab17f94e5b2a636e6609c93fa6/python-package/lightgbm/basic.py#L2635-L2648

I believe you'll need to create a proposal for extracting the Config_ property from the Booster after it's loaded.

https://github.com/microsoft/LightGBM/blob/da98f24711a2faab17f94e5b2a636e6609c93fa6/src/boosting/gbdt.h#L459

"Config" is the word we use in LightGBM's C++ code to refer to an object that holds all parameters (see e.g. https://github.com/microsoft/LightGBM/pull/4724#pullrequestreview-790133134).

Here's code called by LGBM_BoosterCreateFromModelfile() which gets parameters from the model text file.

https://github.com/microsoft/LightGBM/blob/d517ba12f2e7862ac533908304dddbd770655d2b/src/boosting/gbdt_model_text.cpp#L571-L596


I'll re-open this issue for now since you're planning to work on it. We have a policy in this repo of keeping feature request issues marked "closed" if no one is working on them, so if for any reason you decide not to work on this feature for now, please let me know so we can re-close it.

And if you are interested in contributing but feel that this feature is not right for you, now that you know more about it, let me know what you're looking to work on and I'd be happy to suggest another one. Thanks again for your help!

zyxue commented 3 years ago

Thank you @jameslamb for the informative guide! I'll try to get to it.

zyxue commented 3 years ago

loaded_parameter_ isn't accessible via Boosting Class in cpp code right? loaded_parameter_ looks an attribute specific to GBDT only?

zyxue commented 2 years ago

Hey @jameslamb , do you have any feedback on my PR above, please? I wonder if that's the right direction for loading back saved the params?

jameslamb commented 2 years ago

thanks for starting on the work @zyxue ! We will get to reviewing it as soon as possible.

I and a few other maintainers here work on LightGBM in our spare time, so we can sometimes be slow to respond (especially for larger features like this one which require more effort to review). Thanks for your patience.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb commented 1 year ago

This was locked accidentally. I just unlocked it. We'd still welcome contributions related to this feature!