microsoft / lightgbm-benchmark

Benchmark tools for LightGBM
MIT License
14 stars 7 forks source link

Reduce the number of metrics produced by distributed training #184

Closed jfomhover closed 3 years ago

jfomhover commented 3 years ago

The current implementation of distributed lightgbm produced multiple metrics per node: main metric curve + training time, data loading time, etc.

When running on N nodes, this produces N*6 metrics, quickly reaching the mlflow/azureml limits, and becoming a UI nightmare. We're also hitting 439 exceptions in the mlflow call.

The proposition is to group some of the metrics together (training time).