microsoft / lightgbm-benchmark

Benchmark tools for LightGBM
MIT License
14 stars 7 forks source link

Reduce the number of metrics in lightgbm training to allow for distributed jobs #183

Closed jfomhover closed 3 years ago

jfomhover commented 3 years ago

The previous implementation of distributed lightgbm produced multiple metrics per node: main metric curve + training time, data loading time, etc.

When running on N nodes, this produces N*6 metrics, quickly reaching the mlflow/azureml limits. This PR reduces the number of metrics produces by lightgbm training by:

For N nodes, this produces 5 + N metrics instead.

Also, adding exceptions around mlflow to ensure job still works if mlflow breaks due to metric overflow.

github-actions[bot] commented 3 years ago

Unit Test Results for Build

  1 files  ±0    1 suites  ±0   8s :stopwatch: -1s 76 tests ±0  76 :heavy_check_mark: ±0  0 :zzz: ±0  0 :x: ±0 

Results for commit dee4177c. ± Comparison against base commit d359c939.

:recycle: This comment has been updated with latest results.