mlr3learners / mlr3learners.lightgbm

Learners from {lightgbm} for mlr3
GNU Lesser General Public License v2.1
9 stars 3 forks source link

Sparse Matrix Support #10

Open ThomasWolf0701 opened 4 years ago

ThomasWolf0701 commented 4 years ago

The current code transforms all data into a matrix with as.matrix() private$dtrain = lightgbm::lgb.Dataset( data = as.matrix(data[, task$feature_names, with = F]), label = label, free_raw_data = FALSE )

But both mlr3 and the lightgbm R package support sparse matrices:

https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.Dataset.html data a matrix object, a dgCMatrix object or a character representing a filename

and mlr3 https://mlr3.mlr-org.com/reference/DataBackendMatrix.html

It would be great if sparse matrices (dgCMatrix ) would be supported. Maybe as(data,"sparseMatrix") or so.

Would be really great if this would be supported.

ThomasWolf0701 commented 4 years ago

From my understanding in the case of using the mlr3 data table based your code uses some preprocessing steps to transform the data.table infto a data.frame and then transform this into a numerical format useable by lightgbm and then into a matrix with as.matrix() to the lgb.Dataset function ?

If the user uses mlr3 with a DataBackendMatrix this matrix could directly be passed to the lgb.Dataset function without as.matrix then the sparsity would even be preserved using the canonical mlr3 way.

statist-bhfz commented 4 years ago

I'm not sure if it worth to set dgMatrix as default format. Maybe additional parameterization is required? And one more if-else statement in https://github.com/mlr3learners/mlr3learners.lightgbm/blob/development/R/backend_preprocessing.R where all preprocessing steps should be moved.

kapsner commented 4 years ago

I am currently over it! @statist-bhfz, good idea, to move all to the backend_preprocessing; however I need to figure out, how to do it best, since we currently seem to need the "as.matrix" function for passing data.tables to lgb.Dataset

statist-bhfz commented 4 years ago

@kapsner as.matrix() is not mandatory, lgb.Dataset() also supports dgCMatrix objects: https://lightgbm.readthedocs.io/en/latest/R/reference/lgb.Dataset.html mlr3 has DataBackendMatrix backend which stores the data in sparse format, but this format is Matrix::sparseMatrix(). I think, the most convenient solution is to use data.table as the only backend for lightgbm learner and allow to switch between matrix and dgMatrix in learner's parameters list. Some time ago I wrote simple lightgbm wrapper for my tiny ML framework and some parts of code look very similar to your current implementation.

kapsner commented 4 years ago

Indeed, thats correct.

The problem is, that this

data = task$data(
        cols = task$feature_names,
        data_format = "Matrix"
      )

does not work with data.table backends (there is no internal transformation) and I need to figure out a different solution for allowing both data backends.

(https://github.com/mlr3learners/mlr3learners.lightgbm/blob/master/R/LearnerClassifLightGBM.R#L651)

statist-bhfz commented 4 years ago

I could be wrong, but it's necessary to specify matrix backend instead of data.table during task construction to get it work:

data = task$data(
        cols = task$feature_names,
        data_format = "Matrix"
      )

Possibly DataBackendMatrix support is not the most requested option? I don't see any advantages for https://mlr3.mlr-org.com/reference/DataBackendMatrix.html compared to staying with data.table backend followed by (sparse) matrix transformation inside the learner.

DataBackend for Matrix. Data is split into a (numerical) sparse part and an optional dense part. These parts are automatically merged to a sparse format during $data(). Note that merging both parts potentially comes with a data loss, as all dense columns are converted to numeric columns.

Potential data loss is quite serious contraindication!

ThomasWolf0701 commented 4 years ago

I could be wrong, but it's necessary to specify matrix backend instead of data.table during task construction to get it work:

data = task$data(
        cols = task$feature_names,
        data_format = "Matrix"
      )

Possibly DataBackendMatrix support is not the most requested option? I don't see any advantages for https://mlr3.mlr-org.com/reference/DataBackendMatrix.html compared to staying with data.table backend followed by (sparse) matrix transformation inside the learner.

DataBackend for Matrix. Data is split into a (numerical) sparse part and an optional dense part. These parts are automatically merged to a sparse format during $data(). Note that merging both parts potentially comes with a data loss, as all dense columns are converted to numeric columns.

Potential data loss is quite serious contraindication!

This is how tidymodels seems to handle this issue, but to my understanding this would not be consistent with how mlr3 was designed. If the user already prepared the data as a numeric matrix the data loss should not occur. For factors it would anyway be the data.table backend.