Closed jfomhover closed 3 years ago
1 files ±0 1 suites ±0 8s :stopwatch: -1s 76 tests ±0 76 :heavy_check_mark: ±0 0 :zzz: ±0 0 :x: ±0
Results for commit dee4177c. ± Comparison against base commit d359c939.
:recycle: This comment has been updated with latest results.
The previous implementation of distributed lightgbm produced multiple metrics per node: main metric curve + training time, data loading time, etc.
When running on N nodes, this produces N*6 metrics, quickly reaching the mlflow/azureml limits. This PR reduces the number of metrics produces by lightgbm training by:
For N nodes, this produces 5 + N metrics instead.
Also, adding exceptions around mlflow to ensure job still works if mlflow breaks due to metric overflow.