microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.55k stars 3.82k forks source link

[R] slow when function returning R6 object #539

Closed BlindApe closed 6 years ago

BlindApe commented 7 years ago

I'll give a detailed enviroment sheet and a reproducible example, but meanwhile this is the problem:

d.level1 <- lgb.Dataset(data = data.matrix(level1)) lightgbm::setinfo(d.level1, 'label', level1.response) Error in paste(as.character(obj), collapse = " ") : long vectors not supported yet: memory.c:1652 length(level1.response) [1] 4214186

level1 has 4214186 rows and 1152 columns.

guolinke commented 7 years ago

I think 4214186 isn't long. And this issue is strange. Did you run it immediately after installing LightGBM ? you can try to restart R and run it again.

BlindApe commented 7 years ago

The problem is reproducible after restarting the machine. With a few columns it works, so I think the problem could be with 4212186 * 1152 that is greater than 2^32. I've noticed lightgbm::setinfo in R takes a long long time with this big dataset, and finally crash. I've tried the same dataset with xgboost and setinfo is a blink.

guolinke commented 7 years ago

@BlindApe SetInfo have nothing to do with number of columns. It only use one column. And paste(as.character(obj), collapse = " ") isn't called by lightgbm. Did you check the type of level1.response ? can you try following code?

d.level1 <- lgb.Dataset(data = data.matrix(level1))
label <- level1.response
print(typeof(label))
print(label)
lgb.setinfo(d.level1, "label", label)
guolinke commented 7 years ago

BTW, SetInfo only copy ( copy the reference ) the passed array. So, It should be very fast.

Laurae2 commented 7 years ago

@BlindApe label must be a vector like in xgboost.

BlindApe commented 7 years ago

@guolinke , @Laurae2 , thank for your support.

As you see this code takes 1 minute with a matrix of 20 columns, nearly 5 minutes with 100 columns and an eternity (I'm still waiting a crash) with 1300 columns. lgb.setinfo isn't visible so I need to use setinfo(). The type of label is 'double'.

This is the sessionInfo(): R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 15.10

locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] lightgbm_0.1 R6_2.2.1 data.table_1.10.4

loaded via a namespace (and not attached): [1] magrittr_1.5 tools_3.3.1

It's a dual xeon server with 2 x E5-2690 v2 and 256 Gb RAM.

Code:

data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)

Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) setinfo: function (name, info) set_reference: function (reference) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0.96 mins data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label) Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) setinfo: function (name, info) set_reference: function (reference) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 4.79 mins data <- matrix(1, 4000000, 1300) label <- rep(0, 4000000) train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)
BlindApe commented 7 years ago

With xgboost (dataset creation with 1300 columns takes a while, but setinfo is a blink):

rm(list = ls(all = TRUE)) require(data.table) Loading required package: data.table data.table 1.10.4 The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way Documentation: ?data.table, example(data.table) and browseVignettes("data.table") Release notes, videos and slides: http://r-datatable.com require(xgboost) Loading required package: xgboost

data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- xgb.DMatrix(data = data) tic = proc.time()[3] setinfo(train, 'label', label) [1] TRUE cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins

data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) train <- xgb.DMatrix(data = data) tic = proc.time()[3] setinfo(train, 'label', label) [1] TRUE cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins

data <- matrix(1, 4000000, 1300) label <- rep(0, 4000000) train <- xgb.DMatrix(data = data) tic = proc.time()[3] setinfo(train, 'label', label) [1] TRUE cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins

Laurae2 commented 7 years ago

I can't reproduce it.

I am getting this:

> library(R.utils)
> library(lightgbm)
> data <- matrix(1, 4000000, 20)
> label <- rep(0, 4000000)
> train <- lgb.Dataset(data = data, label = label)
> Laurae::timer_func_print(setinfo(train, 'label', label))
The function ran in 2.001 milliseconds.
[1] 2.000977
> library(R.utils)
> library(lightgbm)
> data <- matrix(1, 4000000, 100)
> label <- rep(0, 4000000)
> label2 <- rep(1, 4000000)
> train <- lgb.Dataset(data = data, label = label)
> Laurae::timer_func_print(setinfo(train, 'label', label2))
The function ran in 3.004 milliseconds.
[1] 3.00415

From a fully fresh session:

> library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.21.0 (2016-10-30) successfully loaded. See ?R.oo for help.

Attaching package: ‘R.oo’

The following objects are masked from ‘package:methods’:

    getClasses, getMethods

The following objects are masked from ‘package:base’:

    attach, detach, gc, load, save

R.utils v2.5.0 (2016-11-07) successfully loaded. See ?R.utils for help.

Attaching package: ‘R.utils’

The following object is masked from ‘package:RevoMods’:

    timestamp

The following object is masked from ‘package:utils’:

    timestamp

The following objects are masked from ‘package:base’:

    cat, commandArgs, getOption, inherits, isOpen, parse, warnings

> library(lightgbm)
Loading required package: R6
Warning message:
package ‘R6’ was built under R version 3.3.3 
> data <- matrix(1, 4000000, 100)
> label <- rep(0, 4000000)
> label2 <- rep(1, 4000000)
> train <- lgb.Dataset(data = data)
> Laurae::timer_func_print(setinfo(train, 'label', label2))
The function ran in 1.001 milliseconds.
[1] 1.000977

Using a laptop here.

Did I miss something from your script?

BlindApe commented 7 years ago

The problem is setinfo after lgb.Dataset. Doing data and label at same time isn't problem:

data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data, label = label) tic = proc.time()[3] cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins

Laurae2 commented 7 years ago

@BlindApe My last example was done without doing data and label at the same time. It is still taking about 1 millisecond.

However, when I replace all 1s by 0s, I get the issue you mentioned (but it takes only 10 seconds here, not 5 minutes).

@guolinke any reason why it is slower when using 0s instead of 1s? It seems to duplicate the data also, which duplicates the memory used.

BlindApe commented 7 years ago

This is in other server, running different R version under Windows:

sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lightgbm_0.1 R6_2.2.1 data.table_1.10.4 loaded via a namespace (and not attached): [1] compiler_3.4.0 magrittr_1.5

rm(list = ls(all = TRUE)) require(data.table) Loading required package: data.table data.table 1.10.4 The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way Documentation: ?data.table, example(data.table) and browseVignettes("data.table") Release notes, videos and slides: http://r-datatable.com require(lightgbm) Loading required package: lightgbm Loading required package: R6

data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)

Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) set_reference: function (reference) setinfo: function (name, info) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 1.79 mins
Laurae2 commented 7 years ago

Can you change label <- rep(0, 4000000) with label <- rep(1, 4000000) and report the time?

BlindApe commented 7 years ago

I was doing:

data <- matrix(1, 4000000, 20) label <- rep(1, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 1 1 1 1 1 1 train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)

Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) set_reference: function (reference) setinfo: function (name, info) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 1.83 mins
guolinke commented 7 years ago

@BlindApe can you try to replace setinfo(train, 'label', label) with train$setinfo(`label`, label) (Doing data and label at the same time will call this) ? @Laurae2 I check the code for setinfo: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L413-L451 I think the most time consuming part is the data conversion: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L423-L428 .
The LGBM_DatasetSetField_R won't be called here since we use lazy init.

guolinke commented 7 years ago

@BlindApe can you also test the time cost of label <- as.numeric(label) in your machine ?

BlindApe commented 7 years ago

@guolinke

Seems the problem isn't in as.numeric(). Replacing the function with directly invoke the method has the same behavior.

In linux server:

rm(list = ls(all = TRUE)) require(lightgbm)

data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 57.443 secs

In Windows server:

rm(list = ls(all = TRUE)) require(lightgbm)

data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0.05 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 93.71 secs

BlindApe commented 7 years ago

Time is proportional to rows and columns, and the behavior is the same both in Windows and Linux

data <- matrix(1, 4000, 20) label <- rep(0, 4000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0.002 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 0.124 secs

data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0.001 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 287.985 secs

Laurae2 commented 7 years ago

81.27s:

require(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
print(typeof(label))
print(head(label))
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n') // 81.27 secs 

but running this is really strange as it does exactly the same thing but faster!:

library(profvis)

profvis({
  library(lightgbm)
  data <- matrix(1, 4000000, 20)
  label <- rep(1, 4000000)
  train <- lgb.Dataset(data = data)
  train$setinfo("label", label)
})

str(train$getinfo("label")) // CORRECT ANSWER: num [1:4000000] 1 1 1 1 1 1 1 1 1 1 ...

image

The behavior is really strange when out of a function and running in the global environment.

BlindApe commented 7 years ago

I think the message I obtained first time I tried with 1152 columns could give a tip:

Error in paste(as.character(obj), collapse = " ") : long vectors not supported yet: memory.c:1652

It's like internally all data were pasted in some way...

guolinke commented 7 years ago

@Laurae2 @BlindApe maybe the reason is the memory relocation of R6 Class (Or R?).
When set this : https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L431, R6 may copy the data to a new memory address. Thus, when data is large, it will cost more time.

And it calls paste(as.character(obj), collapse = " "). in memory.c, this seems is for some memory operations as well.

Laurae2 commented 7 years ago

I ran this:

library(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(1, 4000000)
train <- lgb.Dataset(data = data)

debugonce(train$setinfo)
train$setinfo("label", label)

It seems there is a problem with return(self). It takes forever from that point.

This post shows an issue with assignments on R6 objects and type checking leading to 30x slower assignments on small stuff: https://cran.r-project.org/web/packages/R6/vignettes/Performance.html

BlindApe commented 7 years ago

@Laurae2 I've reproduced the behaviour doing inside a naive function:

rm(list = ls(all = TRUE)) require(lightgbm) Loading required package: lightgbm Loading required package: R6 nothing <- function() { data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') print(typeof(label)) print(head(label)) train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') }

nothing() 0 secs [1] "double" [1] 0 0 0 0 0 0 0 secs

So a temporary work around is call it inside a function.

guolinke commented 7 years ago

@Laurae2 What is the state of this issue now ?

Laurae2 commented 7 years ago

@guolinke Currently can't seem to fix this, I managed to delay the issue but this is not a good workaround.

Laurae2 commented 7 years ago

@BlindApe Your/Our workaround works until we print the train model. You can test this code to see the issue:

lgb.unloader(wipe = TRUE)
require(lightgbm)
nothing <- function() {
  data <- matrix(1, 4000000, 20)
  label <- rep(0, 4000000)
  tic = proc.time()[3]
  label <- as.numeric(label)
  cat(proc.time()[3] - tic, 'secs', '\n')
  print(typeof(label))
  print(head(label))
  train <- lgb.Dataset(data = data)
  tic = proc.time()[3]
  train$setinfo('label', label)
  cat(proc.time()[3] - tic, 'secs', '\n')
  return(train)
}
train <- nothing() # FAST
train # SLOW
guolinke commented 7 years ago

@Laurae2 So R console will print the return object, and the print will have some serialization, which causes the slow-down ?

Laurae2 commented 7 years ago

@guolinke The cause is in the screenshot:

image

I have no idea what this as.character is actually doing here. But it is directly related to print and it explains why the duration scales with the size of the matrix(ces).

In RStudio, doing this reproduces exactly the issue:

library(profvis)

profvis({
  library(lightgbm)
  data <- matrix(1, 4000000, 20)
  label <- rep(1, 4000000)
  train <- lgb.Dataset(data = data)
  train$setinfo("label", label)
  print(train)
})
guolinke commented 7 years ago

I think this https://stat.ethz.ch/R-manual/R-devel/library/base/html/invisible.html may can solve this issue.

Laurae2 commented 7 years ago

@guolinke the issue is still here when returning invisibly. I start to think it is the fault of the environment / R6 object, and not of LightGBM. The print function is 100% unrelated to LightGBM, even though LightGBM seems to create the issue.

guolinke commented 7 years ago

@Laurae2 after using invisible, the problem only show up when manually print the R6 object.

guolinke commented 7 years ago

@BlindApe can you try the latest code and see what happen ?

Laurae2 commented 7 years ago

@guolinke It defers the issue when printing. We may add a warning on docs about it.

BlindApe commented 7 years ago

@guolinke @Laurae2

I confirm the problem is still there when printing the object (running the Laurae2 example code). I can live using invisible and without printing the objects. Adding a warning on docs would be nice.

Laurae2 commented 7 years ago

@BlindApe Warning added in https://github.com/Microsoft/LightGBM/pull/555

guolinke commented 6 years ago

close since it is R6`s bug.