[R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times)

AGLilly commented 1 year ago

Problem:

I have a binomial prediction problem. One characteristic of my problem is that I have many rows of data with the same dependent and independent criteria. I am trying to speed up learning by using weighted learning (by aggregating (rolling up on all columns) the same data rows together and giving count of the rows as weights). However, When I compare the results of the two exercises, I do not get same result in terms cross entropy, Out of sample error on a different dataset.

Data Formats

Weighted Training data format

weight  entity  outcome covariates
2            E1              0             EC1
3            E2              1             EC2

Expanded Data training format

entity  outcome covariates
E1              0                 EC1
E1              0                 EC1
E2              1                 EC2
E2              1                 EC2
E2              1                 EC2

Outcome = 0 -> Failure
Outcome = 1 -> Success

Number of rows in expanded data is equal to sum of weight variable in weighted training

###### Expanded data format LGB run ########

Input_data <- TRAINING DATASET # (Expanded data format in the excel)

## Create LGB Data for the expanded data set format
outcome_name <- "output"
weight_name <- "weight"

Input_data_mat <- Input_data %>% select(-all_of(outcome_name))

# create the lgb data without the weight variable
lgb_data <-  
  lgb.Dataset(
    data = Input_data_mat %>% 
      as.matrix(),
    label = Input_data %>% 
      pull(!!rlang::sym(outcome_name))
  )

params <- list(objective = "cross_entropy",max_bin = 5000,
               learning_rate = 0.01,max_depth = -1,num_leaves = 31)

## Train the model with exactly the same params and No .weight variable
lgb_mod <- lgb.cv(
  params = params,
  data = lgb_data,
  nrounds = 10000,
  early_stopping_rounds = 5
)

## Compare the best iteration and best score
print(glue::glue(" The best iteration is: {lgb_mod$best_iter} & best CE : {lgb_mod$best_score}"))

## Train the lgb model based on the cross validated parameters
lgb_trained_model <- lgb.train(params = params,
                               data = lgb_data,
                               nrounds = lgb_mod$best_iter)

## Generate OOS predictions on the test data
test_data <- TEST DATA for predictions (again in the expanded format)

test_matrix <- test_data %>% 
                        select(-all_of(outcome_name)) %>% as.matrix()                    

## Make predictions on the above datasets
predictions <- predict(lgb_trained_model, test_matrix)

###### WEIGHTED TRAINING SCHEME ########

Input_data <- TRAINING DATASET (Weighted training data format in the excel WITH)

## Create LGB Data for the new data set format
outcome_name <- "output"
weight_name <- ".weight"

## Prepare model matrix for inputtng to lgb.dataset
Input_data_mat <-  Input_data %>% 
  select(-all_of(outcome_name),
         -all_of(weight_name))

# create the lgb data with the weight variable
lgb_data <-  
  lgb.Dataset(
    data = Input_data_mat %>% 
      as.matrix(),
    label = mod_data %>% 
      pull(!!rlang::sym(outcome_name)),
    weight = mod_data %>% 
      pull(!!rlang::sym(weight_name))
  )

params <- list(objective = "cross_entropy",max_bin = 5000,
               learning_rate = 0.01,max_depth = -1,num_leaves = 31)

## Train the model with exactly the same params and No .weight variable
lgb_mod <- lgb.cv(
  params = params,
  data = lgb_data, 
  nrounds = 10000,
  early_stopping_rounds = 5
)

## Compare the best iteration and best score
print(glue::glue(" The best iteration is: {lgb_mod$best_iter} & best CE : {lgb_mod$best_score}"))

## Train the lgb model based on the cross validated parameters
lgb_trained_model <- lgb.train(params = params,
                   data = lgb_data,
                   nrounds = lgb_mod$best_iter)

## Generate OOS predictions on the test data
test_data <- TEST DATA for predictions (again in the weighted format)

## Create LGB Data for the new data set format
outcome_name <- "nbrx_brand"
weight_name <- ".weight"

## Create the new prediction dataset. Remove outcome and weight variables
test_data_pred <- test_data %>% 
  select(-all_of(outcome_name),
         -all_of(weight_name)) %>% 
  as.matrix()

## Make predictions on the above datasets
predictions <- predict(lgb_trained_model, test_data_pred)

## Get predictions from model and weight in the same frame
output <- as.data.frame(cbind(test_data$.weight,predictions))

## Get the total nbrx predicted
output$total <- output$V1 * output$predictions

## Get total sum nbrx for the predictions
print(glue::glue("Total predicted output :{sum(output$total)}"))

When I compare the cross entropy of the two model runs as well as the out of sample predictions. They are very different. OOS Predictions are different by on average 50%.

Questions

Am I doing it correctly?
Is there an example in R for me to replicate?

mayer79 commented 1 year ago

Can you please add three backticks before and after the code for proper formatting?

You will need to remove all regularizations like min_sum_hessian etc. to have a chance that the results match. Not all of them are 0. Interesting question though.

jameslamb commented 1 year ago

Thanks for using LightGBM. Sorry it took so long for someone to respond to you here.

I've reformatted your question to make the difference between code, output from code, and your own words clearer. If you're not familiar with how to do that in markdown, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.

jameslamb commented 1 year ago

To @mayer79 's point... there are many parameters in LightGBM whose impact on the final model produced are sensitive to the number of rows in the training data, for example:

Parameters evaluated against a sum of row-wise values:

min_sum_hessian_in_leaf
min_gain_to_split

Parameters evaluated against a count of rows:

min_data_in_leaf
min_data_in_bin
min_data_per_group (for categorical features)

Duplicating rows changes those sums and counts. For example, imagine a dataset with 0 duplicates where you train with min_data_in_leaf = 20 (the default). LightGBM might avoid severe overfitting because it will not add splits that result in a leaf having fewer than 20 cases. Now imagine that you duplicated every row in the dataset 20 times, and retrained without changing the parameters. LightGBM might happily add splits that produced leaves which only matched 20 copies of the same data... effectively memorizing a single specific row in the training data! That'd hurt the generalizability of the trained model.

You can learn about these parameters at https://lightgbm.readthedocs.io/en/latest/Parameters.html.

I recommend just proceeding with the approach you described... eliminating identical rows by instead preserving only one row for each unique combination of ([features], target) and using their count (or even better, % of total rows) as a weight. That'll result in less memory usage and faster training, and you should be able to achieve good performance.

NOTE: I did not run your example code, as its quite large and (crucially) doesn't include the actual training data. If you're able to provide a minimal, reproducible example (docs on what that is) showing the performance difference, I'd be happy to investigate further. Please note the word "minimal" there especially... it is really necessary to train for 10,000 rounds, with 5000 bins per feature, to demonstrate the behavior you're asking about? Is it really necessary to use {glue} for just formatting a print() statement, instead of sprintf() or paste0()? If you could cut down the example to something smaller and fully-reproducible, it'd reduce the effort required for us to help you.

mayer79 commented 1 year ago

@jameslamb: Fantastic explanation, thanks!

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions[bot] commented 4 weeks ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

[R-package] Weighted Training - Different results (cross entropy) when using .weight column Vs inputting the expanded data (repeated rows = .weight times) #5626