The current implementation of distributed lightgbm produced multiple metrics per node: main metric curve + training time, data loading time, etc.
When running on N nodes, this produces N*6 metrics, quickly reaching the mlflow/azureml limits, and becoming a UI nightmare. We're also hitting 439 exceptions in the mlflow call.
The proposition is to group some of the metrics together (training time).
The current implementation of distributed lightgbm produced multiple metrics per node: main metric curve + training time, data loading time, etc.
When running on N nodes, this produces N*6 metrics, quickly reaching the mlflow/azureml limits, and becoming a UI nightmare. We're also hitting 439 exceptions in the mlflow call.
The proposition is to group some of the metrics together (training time).