microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

Python - metric and nthread #110

Closed gugatr0n1c closed 7 years ago

gugatr0n1c commented 7 years ago

Hi,

1] it seems to me that nthread is not working in Python interface - no matter what I set all threads are used.

2] If I have 4cpu with multithreading = 8 threads, this still call all 8 - default setting is -1, so it should call 4 or not?

3] is there way to set metric from default l2 to l1 in python? set metric = 'l1' is not working

thx

calling this:

model = lg.LGBMRegressor( objective = 'regression',

metric = 'l2', - commented

                        n_estimators = 25000,
                        learning_rate = 0.0025,
                        num_leaves = 1000,
                        max_depth = 15,
                        min_child_samples = 500,
                        colsample_bytree = 0.2,
                        subsample = 0.83,
                        subsample_freq = 1,
                        nthread = 1,
                        silent = True
                    ).fit(matrix_train, target_train, [(matrix_test, target_test)])
wxchan commented 7 years ago

1,2) it's same as c++, is nthread in c++ working for you? 3) eval_metric is used in fit()

full API:

__init__(num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=10, 
max_bin=255, silent=True, objective="regression", nthread=-1, min_split_gain=0, 
min_child_weight=5, min_child_samples=10, subsample=1, subsample_freq=1, 
colsample_bytree=1, reg_alpha=0, reg_lambda=0, scale_pos_weight=1, 
is_unbalance=False, seed=0)
fit(X, y, eval_set=None, eval_metric=None, early_stopping_rounds=None, 
verbose=True, train_fields=None, valid_fields=None, feature_name=None, 
categorical_feature=None, other_params=None)
predict(data, raw_score=False, num_iteration=0)
gugatr0n1c commented 7 years ago

1,2] actually I used pyLightGBM, and with that it was working

3] ok thx, so it is different usage than in xgboost, where eval_metric is NOT used as input for traning, but just for monitoring on valid_data.. here it is input for training how to split data in leafs, right?

guolinke commented 7 years ago

@gugatr0n1c 3]. I think xgboost is also use eval_metric in fit. https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit

guolinke commented 7 years ago

fix 1. for 2, you should set nthread=4, the default value of openmp is 8.

gugatr0n1c commented 7 years ago

no definitely not, in xgboost it is only used for valid data, https://github.com/dmlc/xgboost/blob/master/doc/parameter.md or here https://github.com/dmlc/xgboost/blob/ef4dcce7372dbc03b5066a614727f2a6dfcbd3bc/src/objective/regression_obj.cc

Xgboost has for regression only RMSE for training and this is not possible to change - there is new plugin system there, where user can change objective, but not with eval_metric

anyway, when I tried to modify eval_metric here in fit() it is just changing log ouput for valid_data, but spliting is always with l2

is it possible to add "training_metric" to change l2 to l1?

gugatr0n1c commented 7 years ago

hmm, it does not make any sence to use l2 for training and l1 for valid_data monitoring... maybe eval_metric should influace objective fully... (drawbacks is that then only one eval_metric can be used - for some classification problem is good to monitor auc, recall and logloss)

up to you guys... but thanks for your great work, this library is outperforming xgb badly :)

guolinke commented 7 years ago

@gugatr0n1c I am not quite understand what you mean. you want to output training metric during training? You can add training data to eval_set. fit(matrix_train, target_train, [(matrix_train, target_train), (matrix_test, target_test)], eval_metric="l1")

And LightGBM will not load train_data twice.

gugatr0n1c commented 7 years ago

strange, I changed to code as suggest here

to: params = { 'task' : 'train', 'boosting_type' : 'gbdt', 'objective' : 'regression', 'metric' : 'l1', 'max_depth' : 15, 'num_leaves' : 1000, 'min_data_in_leaf' : 1000, 'learning_rate' : 0.0025, 'feature_fraction' : 0.2, 'bagging_fraction' : 0.83, 'bagging_freq': 1, 'verbose' : 0, 'nthread' : 4 }

train

model = lg.train(
params, train_data = (matrix_train, target_train), num_boost_round = 2000, valid_datas = (matrix_test, target_test), early_stopping_rounds = 50 )

and now nthread is working correctly - so the previous is probably only 'sklearn' issue. But I believe metric = 'l1' is not working here as well (but it was working with pyLightGBM).

Goal is to enabled learning regression task with 'l1' metric.

guolinke commented 7 years ago

@gugatr0n1c , I just have a try, and it output L1 metrics. Can you paste your code?

gugatr0n1c commented 7 years ago

@guolinke there two different things:

1] one is to output error on eval_data with chosen metric - this is working correctly for me as well but 2] second is to build tree - when in each split you choosing split according to chosen metric, this I believe is only taking alwas 'l2' - no matter I set metric = 'l1', but when I was using pyLightGBM it was working

guolinke commented 7 years ago

are you sure pyLightGbm can do this? you need to change objective funtion to support this.

guolinke commented 7 years ago

and tree split is not according to any metrics. It uses the gradients calculated by objective function.

guolinke commented 7 years ago

And you can give a example how to set pyLightGBM to use L1 as split tree. I will check for it.

gugatr0n1c commented 7 years ago

ok, again, sorry for confusion...

anyway, I just modify this issue as proposal of creating new objective (instead of accusation that somethong is not working): MAEregression, where objective is based on absolute difference, not on least squares... similary as in deeplearning library MxNET: https://turi.com/products/create/docs/generated/graphlab.mxnet.symbol.MAERegressionOutput.html#graphlab.mxnet.symbol.MAERegressionOutput

this is robust solution for regression when target has many outliers thx

gugatr0n1c commented 7 years ago

or as it is in scikit http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

wxchan commented 7 years ago

@gugatr0n1c lightgbm is same as xgboost in this part. You can see this thread. It includes how you use MAE as objective.

guolinke commented 7 years ago

@gugatr0n1c
I think you misunderstand for parameter is_training_metric. its fullname is is_provide_training_metric, which means print metric of training data or not.

gugatr0n1c commented 7 years ago

ok thx, for explanation, closing

supdizh commented 7 years ago

I have the same issue with 1) nthread is not working in both python vanila and scikit API. it works in bin + train.conf, however. my settings: param = {
'task': 'train',
'boost_type': 'gbdt',
'objective': 'multiclass',
'num_class': 3,
'max_bin': 255,
'learning_rate': 1,
'num_leaves': 31,
'verbose': 1,
'nthread': 1,
}

guolinke commented 7 years ago

@supdizh I just have a try. nthread works. can you paste the full code ? especially for the use of parameters.

supdizh commented 7 years ago

@guolinke never mind. stranglely the exact code works now

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.