microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.74k stars 3.84k forks source link

After saving the datasets to disk, training with validation on the test set doesn't work (bins mapper issue) #6561

Open jaguerrerod opened 4 months ago

jaguerrerod commented 4 months ago

Description

I have very big train and test datasets (> 500 Gb) so I need construct lgb.Dataset from csv files. I use 'two_round = TRUE' parameter to save RAM and leave default free_raw_data After hours I get the datasets and save them to disk. When I load in a clean session and construct dataset with lgb.Dataset.construct() the validation in training doesn't work.

Reproducible example

require(data.table)
require(lightgbm)

set.seed (1)
train <- data.table(target = runif(10000),
                    v1 = sample(1:5, size = 10000, replace = TRUE),
                    v2 = sample(1:5, size = 10000, replace = TRUE),
                    v3 = sample(1:5, size = 10000, replace = TRUE))

test <- data.table(target = runif(5000),
                    v1 = sample(1:5, size = 5000, replace = TRUE),
                    v2 = sample(1:5, size = 5000, replace = TRUE),
                    v3 = sample(1:5, size = 5000, replace = TRUE))

# Save to file as datasets have > 500 Gb
fwrite(train, file = 'train.csv')
fwrite(test, file = 'test.csv')

# Creating lgb.Dataset
d_train <- lgb.Dataset(data = 'train.csv', free_raw_data = TRUE, params = list(header = TRUE, two_round = TRUE, feature_pre_filter = FALSE))
d_test <- lgb.Dataset.create.valid(dataset = d_train, data = 'test.csv', params = list(header = TRUE, two_round = TRUE, feature_pre_filter = FALSE))

# After several hours we save binary compressed dataset to disk
lgb.Dataset.save(d_train, 'train.ds')
lgb.Dataset.save(d_test, 'test.ds')

# Now in a clean session we load datasets
d_train <- lgb.Dataset('train.ds')
lgb.Dataset.construct(d_train)
d_test <- lgb.Dataset('test.ds')
lgb.Dataset.construct(d_test)

# Training a model
params <- list(objective = 'mse',
               learning_rate = 0.01,
               num_threads = 10,
               max_depth = 6,
               num_leaves = 2^6,
               metric = 'mse',
               num_round = 5)

lgb_model <- lgb.train(data = d_train, params = params, verbose = 1, valids = list('test' = d_test))

Error en valid_data$set_reference(data): set_reference: cannot set reference after freeing raw data, please set ‘free_raw_data = FALSE’ when you construct lgb.Dataset

Environment info

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Madrid
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lightgbm_4.4.0    data.table_1.14.8

loaded via a namespace (and not attached):
[1] compiler_4.4.1    R6_2.5.1          Matrix_1.5-4      parallel_4.4.1    tools_4.4.1       rstudioapi_0.15.0
[7] grid_4.4.1        jsonlite_1.8.7    lattice_0.22-5  

Additional Comments

free_raw_data = FALSE doesn't make sense here as the problem is the size of datasets. Why not use a get_bins_mapper() function to extract the mapper from a dataset and pass the bins_mapper to the validation dataset as a parameter? I think this is better than the current embedded bins mappers synchronization.

What can I do? I've tried all the options to work with big data, such as loading from disk, using two rounds, setting free_raw_data = TRUE, saving datasets to disk, etc., and it still doesn't work due to the bins mapper issue.

An important thing: d_train and d_test works well in a training with validation just after construct them. The problem is after save both to disk with lgb.Dataset.save() and load again. Seems lgb.Dataset.save() doesn't save the bins mapper correctly.

jaguerrerod commented 4 months ago

I tried a work around: use a reduced train dataset (I know the bins are the same as all features are integer with exactly 9 values) and create hold dataset using this reduced train but internally lightgbm realized the trick...:

[LightGBM] [Fatal] Cannot add validation data, since it has different bin mappers with training data
Error en booster$add_valid(data = reduced_valid_sets[[key]], name = key): 
  Cannot add validation data, since it has different bin mappers with training data

Please, extract the bin mappers as a function (for extract) and parameter (to setting) in the lgb.Dataset() Work with big data is impossible due bin mappers