Closed BlindApe closed 6 years ago
I think 4214186
isn't long. And this issue is strange.
Did you run it immediately after installing LightGBM ? you can try to restart R and run it again.
The problem is reproducible after restarting the machine. With a few columns it works, so I think the problem could be with 4212186 * 1152 that is greater than 2^32. I've noticed lightgbm::setinfo in R takes a long long time with this big dataset, and finally crash. I've tried the same dataset with xgboost and setinfo is a blink.
@BlindApe
SetInfo
have nothing to do with number of columns. It only use one column.
And paste(as.character(obj), collapse = " ")
isn't called by lightgbm.
Did you check the type of level1.response
?
can you try following code?
d.level1 <- lgb.Dataset(data = data.matrix(level1))
label <- level1.response
print(typeof(label))
print(label)
lgb.setinfo(d.level1, "label", label)
BTW, SetInfo
only copy ( copy the reference ) the passed array. So, It should be very fast.
@BlindApe label must be a vector like in xgboost.
@guolinke , @Laurae2 , thank for your support.
As you see this code takes 1 minute with a matrix of 20 columns, nearly 5 minutes with 100 columns and an eternity (I'm still waiting a crash) with 1300 columns. lgb.setinfo isn't visible so I need to use setinfo(). The type of label is 'double'.
This is the sessionInfo(): R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 15.10
locale:
[1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
[7] LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lightgbm_0.1 R6_2.2.1 data.table_1.10.4
loaded via a namespace (and not attached): [1] magrittr_1.5 tools_3.3.1
It's a dual xeon server with 2 x E5-2690 v2 and 256 Gb RAM.
Code:
data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)
Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) setinfo: function (name, info) set_reference: function (reference) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0.96 mins data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label) Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) setinfo: function (name, info) set_reference: function (reference) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 4.79 mins data <- matrix(1, 4000000, 1300) label <- rep(0, 4000000) train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)
With xgboost (dataset creation with 1300 columns takes a while, but setinfo is a blink):
rm(list = ls(all = TRUE)) require(data.table) Loading required package: data.table data.table 1.10.4 The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way Documentation: ?data.table, example(data.table) and browseVignettes("data.table") Release notes, videos and slides: http://r-datatable.com require(xgboost) Loading required package: xgboost
data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- xgb.DMatrix(data = data) tic = proc.time()[3] setinfo(train, 'label', label) [1] TRUE cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins
data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) train <- xgb.DMatrix(data = data) tic = proc.time()[3] setinfo(train, 'label', label) [1] TRUE cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins
data <- matrix(1, 4000000, 1300) label <- rep(0, 4000000) train <- xgb.DMatrix(data = data) tic = proc.time()[3] setinfo(train, 'label', label) [1] TRUE cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins
I can't reproduce it.
I am getting this:
> library(R.utils)
> library(lightgbm)
> data <- matrix(1, 4000000, 20)
> label <- rep(0, 4000000)
> train <- lgb.Dataset(data = data, label = label)
> Laurae::timer_func_print(setinfo(train, 'label', label))
The function ran in 2.001 milliseconds.
[1] 2.000977
> library(R.utils)
> library(lightgbm)
> data <- matrix(1, 4000000, 100)
> label <- rep(0, 4000000)
> label2 <- rep(1, 4000000)
> train <- lgb.Dataset(data = data, label = label)
> Laurae::timer_func_print(setinfo(train, 'label', label2))
The function ran in 3.004 milliseconds.
[1] 3.00415
From a fully fresh session:
> library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.21.0 (2016-10-30) successfully loaded. See ?R.oo for help.
Attaching package: ‘R.oo’
The following objects are masked from ‘package:methods’:
getClasses, getMethods
The following objects are masked from ‘package:base’:
attach, detach, gc, load, save
R.utils v2.5.0 (2016-11-07) successfully loaded. See ?R.utils for help.
Attaching package: ‘R.utils’
The following object is masked from ‘package:RevoMods’:
timestamp
The following object is masked from ‘package:utils’:
timestamp
The following objects are masked from ‘package:base’:
cat, commandArgs, getOption, inherits, isOpen, parse, warnings
> library(lightgbm)
Loading required package: R6
Warning message:
package ‘R6’ was built under R version 3.3.3
> data <- matrix(1, 4000000, 100)
> label <- rep(0, 4000000)
> label2 <- rep(1, 4000000)
> train <- lgb.Dataset(data = data)
> Laurae::timer_func_print(setinfo(train, 'label', label2))
The function ran in 1.001 milliseconds.
[1] 1.000977
Using a laptop here.
Did I miss something from your script?
The problem is setinfo after lgb.Dataset. Doing data and label at same time isn't problem:
data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data, label = label) tic = proc.time()[3] cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 0 mins
@BlindApe My last example was done without doing data and label at the same time. It is still taking about 1 millisecond.
However, when I replace all 1s by 0s, I get the issue you mentioned (but it takes only 10 seconds here, not 5 minutes).
@guolinke any reason why it is slower when using 0s instead of 1s? It seems to duplicate the data also, which duplicates the memory used.
This is in other server, running different R version under Windows:
sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lightgbm_0.1 R6_2.2.1 data.table_1.10.4 loaded via a namespace (and not attached): [1] compiler_3.4.0 magrittr_1.5rm(list = ls(all = TRUE)) require(data.table) Loading required package: data.table data.table 1.10.4 The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way Documentation: ?data.table, example(data.table) and browseVignettes("data.table") Release notes, videos and slides: http://r-datatable.com require(lightgbm) Loading required package: lightgbm Loading required package: R6
data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)
Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) set_reference: function (reference) setinfo: function (name, info) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 1.79 mins
Can you change label <- rep(0, 4000000)
with label <- rep(1, 4000000)
and report the time?
I was doing:
data <- matrix(1, 4000000, 20) label <- rep(1, 4000000) print(typeof(label)) [1] "double" print(head(label)) [1] 1 1 1 1 1 1 train <- lgb.Dataset(data = data) tic = proc.time()[3] setinfo(train, 'label', label)
Public: construct: function () create_valid: function (data, info = list(), ...) dim: function () finalize: function () get_colnames: function () getinfo: function (name) initialize: function (data, params = list(), reference = NULL, colnames = NULL, save_binary: function (fname) set_categorical_feature: function (categorical_feature) set_colnames: function (colnames) set_reference: function (reference) setinfo: function (name, info) slice: function (idxset, ...) update_params: function (params) Private: categorical_feature: NULL colnames: NULL free_raw_data: TRUE get_handle: function () handle: NULL info: list params: list predictor: NULL raw_data: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... reference: NULL set_predictor: function (predictor) used_indices: NULL cat(round((proc.time()[3] - tic) / 60, 2), ' mins', '\n') 1.83 mins
@BlindApe
can you try to replace setinfo(train, 'label', label)
with train$setinfo(`label`, label)
(Doing data and label at the same time will call this) ?
@Laurae2 I check the code for setinfo: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L413-L451
I think the most time consuming part is the data conversion: https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L423-L428 .
The LGBM_DatasetSetField_R
won't be called here since we use lazy init.
@BlindApe
can you also test the time cost of label <- as.numeric(label)
in your machine ?
@guolinke
Seems the problem isn't in as.numeric(). Replacing the function with directly invoke the method has the same behavior.
In linux server:
rm(list = ls(all = TRUE)) require(lightgbm)
data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 57.443 secs
In Windows server:
rm(list = ls(all = TRUE)) require(lightgbm)
data <- matrix(1, 4000000, 20) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0.05 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 93.71 secs
Time is proportional to rows and columns, and the behavior is the same both in Windows and Linux
data <- matrix(1, 4000, 20) label <- rep(0, 4000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0.002 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 0.124 secs
data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') 0.001 secs print(typeof(label)) [1] "double" print(head(label)) [1] 0 0 0 0 0 0 train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') 287.985 secs
81.27s:
require(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
print(typeof(label))
print(head(label))
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n') // 81.27 secs
but running this is really strange as it does exactly the same thing but faster!:
library(profvis)
profvis({
library(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(1, 4000000)
train <- lgb.Dataset(data = data)
train$setinfo("label", label)
})
str(train$getinfo("label")) // CORRECT ANSWER: num [1:4000000] 1 1 1 1 1 1 1 1 1 1 ...
The behavior is really strange when out of a function and running in the global environment.
I think the message I obtained first time I tried with 1152 columns could give a tip:
Error in paste(as.character(obj), collapse = " ") : long vectors not supported yet: memory.c:1652
It's like internally all data were pasted in some way...
@Laurae2 @BlindApe
maybe the reason is the memory relocation of R6 Class (Or R?).
When set this : https://github.com/Microsoft/LightGBM/blob/master/R-package/R/lgb.Dataset.R#L431, R6 may copy the data to a new memory address. Thus, when data
is large, it will cost more time.
And it calls paste(as.character(obj), collapse = " ")
. in memory.c
, this seems is for some memory operations as well.
I ran this:
library(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(1, 4000000)
train <- lgb.Dataset(data = data)
debugonce(train$setinfo)
train$setinfo("label", label)
It seems there is a problem with return(self)
. It takes forever from that point.
This post shows an issue with assignments on R6 objects and type checking leading to 30x slower assignments on small stuff: https://cran.r-project.org/web/packages/R6/vignettes/Performance.html
@Laurae2 I've reproduced the behaviour doing inside a naive function:
rm(list = ls(all = TRUE)) require(lightgbm) Loading required package: lightgbm Loading required package: R6 nothing <- function() { data <- matrix(1, 4000000, 100) label <- rep(0, 4000000) tic = proc.time()[3] label <- as.numeric(label) cat(proc.time()[3] - tic, 'secs', '\n') print(typeof(label)) print(head(label)) train <- lgb.Dataset(data = data) tic = proc.time()[3] train$setinfo('label', label) cat(proc.time()[3] - tic, 'secs', '\n') }
nothing() 0 secs [1] "double" [1] 0 0 0 0 0 0 0 secs
So a temporary work around is call it inside a function.
@Laurae2 What is the state of this issue now ?
@guolinke Currently can't seem to fix this, I managed to delay the issue but this is not a good workaround.
@BlindApe Your/Our workaround works until we print the train
model. You can test this code to see the issue:
lgb.unloader(wipe = TRUE)
require(lightgbm)
nothing <- function() {
data <- matrix(1, 4000000, 20)
label <- rep(0, 4000000)
tic = proc.time()[3]
label <- as.numeric(label)
cat(proc.time()[3] - tic, 'secs', '\n')
print(typeof(label))
print(head(label))
train <- lgb.Dataset(data = data)
tic = proc.time()[3]
train$setinfo('label', label)
cat(proc.time()[3] - tic, 'secs', '\n')
return(train)
}
train <- nothing() # FAST
train # SLOW
@Laurae2 So R console will print the return object, and the print will have some serialization, which causes the slow-down ?
@guolinke The cause is in the screenshot:
I have no idea what this as.character
is actually doing here. But it is directly related to print
and it explains why the duration scales with the size of the matrix(ces).
In RStudio, doing this reproduces exactly the issue:
library(profvis)
profvis({
library(lightgbm)
data <- matrix(1, 4000000, 20)
label <- rep(1, 4000000)
train <- lgb.Dataset(data = data)
train$setinfo("label", label)
print(train)
})
I think this https://stat.ethz.ch/R-manual/R-devel/library/base/html/invisible.html may can solve this issue.
@guolinke the issue is still here when returning invisibly. I start to think it is the fault of the environment / R6 object, and not of LightGBM. The print function is 100% unrelated to LightGBM, even though LightGBM seems to create the issue.
@Laurae2 after using invisible
, the problem only show up when manually print the R6 object.
@BlindApe can you try the latest code and see what happen ?
@guolinke It defers the issue when printing. We may add a warning on docs about it.
@guolinke @Laurae2
I confirm the problem is still there when printing the object (running the Laurae2 example code). I can live using invisible and without printing the objects. Adding a warning on docs would be nice.
@BlindApe Warning added in https://github.com/Microsoft/LightGBM/pull/555
close since it is R6`s bug.
I'll give a detailed enviroment sheet and a reproducible example, but meanwhile this is the problem:
level1 has 4214186 rows and 1152 columns.