mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
321 stars 62 forks source link

Training does not stop at target achievement #678

Closed Niccolo-Ajroldi closed 6 months ago

Niccolo-Ajroldi commented 6 months ago

Description

Training does not stop when both validation and test targets are reached, resulting in a final score always equal to max_allowed_runtime. The issue appears both in wandb logs and in standard out.

Source or Possible Fix

By inspecting submission_runner.py, I spotted a bug in the code. Simplifying, this is how train_once looks like now:

goals_reached = ..
  while not goals_reached:
    # train

However, goals_reached is not updated anymore after declaration! This is how this block look like:

goals_reached = ..
  while not goals_reached:
    # train
    goals_reached = ...
priyakasimbeg commented 6 months ago

Thanks for catching this and proposing a fix! I'll merge in a fix since the CLA process might take a few days.

Note that this bug does not affect 'actual' score though. The code in scoring/ does check the conditions correctly. Also, the scoring will only use the target validation metric.