Reduce the number of metrics in lightgbm training to allow for distributed jobs

microsoft / lightgbm-benchmark

Benchmark tools for LightGBM

MIT License

14 stars 7 forks source link

The previous implementation of distributed lightgbm produced multiple metrics per node: main metric curve + training time, data loading time, etc.

When running on N nodes, this produces N*6 metrics, quickly reaching the mlflow/azureml limits. This PR reduces the number of metrics produces by lightgbm training by:

adding the node prefix only for the training metrics (callback)
all other metrics are logged with the same name, using step to differenciate between nodes.

For N nodes, this produces 5 + N metrics instead.

Also, adding exceptions around mlflow to ensure job still works if mlflow breaks due to metric overflow.

microsoft / lightgbm-benchmark

Reduce the number of metrics in lightgbm training to allow for distributed jobs #183

Unit Test Results for Build