Training does not stop at target achievement

mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.

Apache License 2.0

321 stars 62 forks source link

Description

Training does not stop when both validation and test targets are reached, resulting in a final score always equal to max_allowed_runtime. The issue appears both in wandb logs and in standard out.

Source or Possible Fix

By inspecting submission_runner.py, I spotted a bug in the code. Simplifying, this is how train_once looks like now:

goals_reached = ..
  while not goals_reached:
    # train

However, goals_reached is not updated anymore after declaration! This is how this block look like:

goals_reached = ..
  while not goals_reached:
    # train
    goals_reached = ...

mlcommons / algorithmic-efficiency

Training does not stop at target achievement #678

Description

Source or Possible Fix