microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.69k stars 3.83k forks source link

[R-Package]Does LightGBM have best_score attribut ? #686

Closed SerigneCisse closed 7 years ago

SerigneCisse commented 7 years ago

Hi! I am trying to use this wonderful package for the first time in R. It's really fast and it works fine.
However I was not able to get the score for early stopping round when I tried model$best_score for a classification task . Could anybody help me in this issue ? Thanks in advance

guolinke commented 7 years ago

@wxchan

wxchan commented 7 years ago

@guolinke I am not familiar with R-package. Maybe @Laurae2 can help.

Laurae2 commented 7 years ago

Assuming the first metric and the first validation dataset are the ones used for early stopping, and assuming it is a metric minimization task, with the variable containing the model named model:

min(as.numeric(unlist(model$record_evals[[2]][[1]])))
# or more simply with best_iter
as.numeric(unlist(model$record_evals[[2]][[1]]))[model$best_iter]
# or again more simply with best_iter
model$record_evals[[2]][[1]][[1]][[model$best_iter]]

should do the task.

@guolinke Do you know where to add it in the R-package? (and how to know if it is minimization or maximization task?)

Example code:

library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)
params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)
model <- lgb.train(params,
                   dtrain,
                   100,
                   valids,
                   min_data = 1,
                   learning_rate = 1,
                   early_stopping_rounds = 10)
which.min(as.numeric(unlist(model$record_evals[[2]][[1]]))) # aka best_iter
min(as.numeric(unlist(model$record_evals[[2]][[1]]))) # aka best_score
as.numeric(unlist(model$record_evals[[2]][[1]]))[model$best_iter] # another way with best_iter
model$record_evals[[2]][[1]][[1]][[model$best_iter]] # probably simpler way with best_iter
SerigneCisse commented 7 years ago

Hi Laurae I tried the last one of your example and it works well for both minimization (binary log-loss) and maximization (F1 score).
However, it gave me by default the score of the training set (and not the validation set). So I changed the valids argument of lgb.train from this : valids <- list(train=dtrain,eval = dval) to this : valids <- list(eval = dval) in order to have the right output.

Thank you very much. And thanks all the community for your amazing work.

guolinke commented 7 years ago

@Laurae2 The best_iter is set in this line: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/callback.R#L353 We can add best_score into Booster and set it here .

SerigneCisse commented 7 years ago

btw @guolinke and @wxchan
Do you know if the python version has a best_score attribut ? or would the code given by Laurae work on python ? ( replacing $ by .)

wxchan commented 7 years ago

check https://github.com/Microsoft/LightGBM/blob/ce999b756af8183541dfc762a9dd819a433bf8a1/tests/python_package_test/test_engine.py#L145-L174

SerigneCisse commented 7 years ago

Hi @wxchan

Thanks , that seems to work (at least I can see the score ) However ( and sorry for the novice question ) that is in dict format. My goal is to get the (floating) best_score from each validation fold then average them outside the loop

So I tried this

`cv_sum = 0

looping through folds:

  cv_score =  clf.best_score[valid_set_name]

  cv_sum = cv_sum + cv_score

score = cv_sum / folds`

But that work (because of the dict format ?)