Closed AGLilly closed 1 year ago
Can you please add three backticks before and after the code for proper formatting?
You will need to remove all regularizations like min_sum_hessian
etc. to have a chance that the results match. Not all of them are 0. Interesting question though.
Thanks for using LightGBM. Sorry it took so long for someone to respond to you here.
I've reformatted your question to make the difference between code, output from code, and your own words clearer. If you're not familiar with how to do that in markdown, please see https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax.
To @mayer79 's point... there are many parameters in LightGBM whose impact on the final model produced are sensitive to the number of rows in the training data, for example:
Parameters evaluated against a sum of row-wise values:
min_sum_hessian_in_leaf
min_gain_to_split
Parameters evaluated against a count of rows:
min_data_in_leaf
min_data_in_bin
min_data_per_group
(for categorical features)Duplicating rows changes those sums and counts. For example, imagine a dataset with 0 duplicates where you train with min_data_in_leaf = 20
(the default). LightGBM might avoid severe overfitting because it will not add splits that result in a leaf having fewer than 20 cases. Now imagine that you duplicated every row in the dataset 20 times, and retrained without changing the parameters. LightGBM might happily add splits that produced leaves which only matched 20 copies of the same data... effectively memorizing a single specific row in the training data! That'd hurt the generalizability of the trained model.
You can learn about these parameters at https://lightgbm.readthedocs.io/en/latest/Parameters.html.
I recommend just proceeding with the approach you described... eliminating identical rows by instead preserving only one row for each unique combination of ([features], target)
and using their count (or even better, % of total rows) as a weight. That'll result in less memory usage and faster training, and you should be able to achieve good performance.
NOTE: I did not run your example code, as its quite large and (crucially) doesn't include the actual training data. If you're able to provide a minimal, reproducible example (docs on what that is) showing the performance difference, I'd be happy to investigate further. Please note the word "minimal" there especially... it is really necessary to train for 10,000 rounds, with 5000 bins per feature, to demonstrate the behavior you're asking about? Is it really necessary to use {glue}
for just formatting a print()
statement, instead of sprintf()
or paste0()
? If you could cut down the example to something smaller and fully-reproducible, it'd reduce the effort required for us to help you.
@jameslamb: Fantastic explanation, thanks!
This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Problem:
I have a binomial prediction problem. One characteristic of my problem is that I have many rows of data with the same dependent and independent criteria. I am trying to speed up learning by using weighted learning (by aggregating (rolling up on all columns) the same data rows together and giving count of the rows as weights). However, When I compare the results of the two exercises, I do not get same result in terms cross entropy, Out of sample error on a different dataset.
Data Formats
Weighted Training data format
Expanded Data training format
Number of rows in expanded data is equal to sum of weight variable in weighted training
###### Expanded data format LGB run ########
When I compare the cross entropy of the two model runs as well as the out of sample predictions. They are very different. OOS Predictions are different by on average 50%.
Questions