microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.52k stars 3.82k forks source link

Logging metrics and time information #4693

Open nd7141 opened 2 years ago

nd7141 commented 2 years ago

Summary

When I use CLI distributed training I would like to write/log metrics and time information during training into file.

What is the current method to see how my training loss evolves?

Motivation

This is a crucial part for debugging ML models: being able to see how training vs test loss behaves allows to capture overfitting. Saving properly time information is also crucial to compare different frameworks as well as the trade-off between quality and speed. Besides time can be broken down into the time of preprocessing and the time to build a tree, which is important to profile the timing of a model.

The only way I found now is to parse through the stdout messages. Did I miss some other ways to log/save metrics and time info?

References

CatBoost example: https://catboost.ai/en/docs/concepts/output-data_training-log

shiyu1994 commented 2 years ago

@nd7141 Thanks for using LightGBM. As far as I know, in CLI version there's no way to store the intermediate results in all iterations into a structured data format. Gently ping @StrikerRUS to confirm. But I agree that storing intermediate results into something like a json file would be very useful.

nd7141 commented 2 years ago

Thanks @shiyu1994. How hard it would be to implement logging into a file?

Also I can see that it's possible to log metrics inside evals_result dict (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html) but is it possible to log time and particularly breakdown between building a tree and preprocessing the data?

shiyu1994 commented 2 years ago

@nd7141 We are focusing on several large pull request recently, and maybe we can schedule the implementation of saving results into file for CLI version on the next month. Contribution is quite welcome.

Yes, with python we can use evals_result, but it does not support recording of time natively. A simple work around would be write a customized evaluation function like this


start_time = time.time()
def feval_time(preds, data):
    return 'time', time.time() - start_time, True
```.
And specify `feval=feval_time` in `lgb.train`, then we can treat time as a metric, and record it in the `evals_result` dict.
nd7141 commented 2 years ago

Thanks @shiyu1994 Can you point me please to the right files to look at to introduce the logging?

StrikerRUS commented 2 years ago

As far as I know, in CLI version there's no way to store the intermediate results in all iterations into a structured data format.

Yeah, that's right.

There is a special compilation option -DUSE_TIMETAG=ON to make LightGBM prints timings.

Users who want to perform benchmarking can make LightGBM output time costs for different internal routines by adding -DUSE_TIMETAG=ON to CMake flags. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html

shiyu1994 commented 2 years ago

@nd7141 In CLI version, the metrics are logged here https://github.com/microsoft/LightGBM/blob/d88b44566e5ec1013b1ea4a669366cebadd77879/src/boosting/gbdt.cpp#L517 And the time are logged here https://github.com/microsoft/LightGBM/blob/d88b44566e5ec1013b1ea4a669366cebadd77879/src/boosting/gbdt.cpp#L275 We may store these information in an internal data structure of GBDT, and add a new parameter to allow users to specify a json file to log these information into.

Blaizzy commented 2 years ago

Hi @nd7141

Prince Canuma here, a Data Scientist at Neptune.ai

I would like to understand why would you want to log your metrics to file, is it a preference? What is your exact use case here?

Cheers,

Blaizzy commented 2 years ago

Hi @nd7141 Just checking in to see if you still need help with this question or if you need anything else.