Open ThomasWolf0701 opened 4 years ago
From my understanding in the case of using the mlr3 data table based your code uses some preprocessing steps to transform the data.table infto a data.frame and then transform this into a numerical format useable by lightgbm and then into a matrix with as.matrix() to the lgb.Dataset function ?
If the user uses mlr3 with a DataBackendMatrix this matrix could directly be passed to the lgb.Dataset function without as.matrix then the sparsity would even be preserved using the canonical mlr3 way.
I'm not sure if it worth to set dgMatrix
as default format. Maybe additional parameterization is required? And one more if-else statement in https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/backend_preprocessing.R where all preprocessing steps should be moved.
I am currently over it! @statist-bhfz, good idea, to move all to the backend_preprocessing; however I need to figure out, how to do it best, since we currently seem to need the "as.matrix" function for passing data.tables to lgb.Dataset
@kapsner as.matrix()
is not mandatory, lgb.Dataset()
also supports dgCMatrix
objects: https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.Dataset.html
mlr3 has DataBackendMatrix
backend which stores the data in sparse format, but this format is Matrix::sparseMatrix()
. I think, the most convenient solution is to use data.table
as the only backend for lightgbm learner and allow to switch between matrix
and dgMatrix
in learner's parameters list.
Some time ago I wrote simple lightgbm wrapper for my tiny ML framework and some parts of code look very similar to your current implementation.
Indeed, thats correct.
The problem is, that this
data = task$data(
cols = task$feature_names,
data_format = "Matrix"
)
does not work with data.table backends (there is no internal transformation) and I need to figure out a different solution for allowing both data backends.
(https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/LearnerClassifLightGBM.R#L651)
I could be wrong, but it's necessary to specify matrix backend instead of data.table
during task construction to get it work:
data = task$data(
cols = task$feature_names,
data_format = "Matrix"
)
Possibly DataBackendMatrix
support is not the most requested option? I don't see any advantages for https://mlr3.mlr-org.com/reference/DataBackendMatrix.html compared to staying with data.table
backend followed by (sparse) matrix transformation inside the learner.
DataBackend for Matrix. Data is split into a (numerical) sparse part and an optional dense part. These parts are automatically merged to a sparse format during $data(). Note that merging both parts potentially comes with a data loss, as all dense columns are converted to numeric columns.
Potential data loss is quite serious contraindication!
I could be wrong, but it's necessary to specify matrix backend instead of
data.table
during task construction to get it work:data = task$data( cols = task$feature_names, data_format = "Matrix" )
Possibly
DataBackendMatrix
support is not the most requested option? I don't see any advantages for https://mlr3.mlr-org.com/reference/DataBackendMatrix.html compared to staying withdata.table
backend followed by (sparse) matrix transformation inside the learner.DataBackend for Matrix. Data is split into a (numerical) sparse part and an optional dense part. These parts are automatically merged to a sparse format during $data(). Note that merging both parts potentially comes with a data loss, as all dense columns are converted to numeric columns.
Potential data loss is quite serious contraindication!
This is how tidymodels seems to handle this issue, but to my understanding this would not be consistent with how mlr3 was designed. If the user already prepared the data as a numeric matrix the data loss should not occur. For factors it would anyway be the data.table backend.
The current code transforms all data into a matrix with as.matrix() private$dtrain = lightgbm::lgb.Dataset( data = as.matrix(data[, task$feature_names, with = F]), label = label, free_raw_data = FALSE )
But both mlr3 and the lightgbm R package support sparse matrices:
and mlr3 https://mlr3.mlr-org.com/reference/DataBackendMatrix.html
It would be great if sparse matrices (dgCMatrix ) would be supported. Maybe as(data,"sparseMatrix") or so.
Would be really great if this would be supported.