mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
321 stars 62 forks source link

Scoring code for external tuning does not match description in DOCUMENTATION.md #693

Closed anana10c closed 6 months ago

anana10c commented 6 months ago

Hello,

As we discussed in today's call, our understanding is that external tuning submission will be scored by taking the median across five studies, each of which takes the best time across five trials. Specifically, a study should always qualify as long as at least one of its trials meets the target.

However, it seems that the method get_index_that_reaches_target in scoring/performance_profile.py performs an additional check to ensure that at least three of five trials in each study meet the target. See lines 147-151:

  # If less than 3 trials reach the target, the submission will be scored as
  # missing the target on this workload; return -1. Else, return the eval index
  # of the earliest point the target is reached.
  if len(target_reached) < 3:
    return -1, -1

This should be an easy fix - I think the lines can just be removed :)

We also noticed that get_summary_df in scoring/score_submissions.py seems to take the time to the best validation accuracy (or relevant metric) rather than the validation target, though this function is not used in any of the computation for the performance profiles. Here's the fix I've been using (starting at line 50):

  summary_df['target reached'] = workload_df[validation_metric].apply(
      lambda x: target_op(x, validation_target)).apply(np.any)
  target_reached_step_indicator = workload_df[validation_metric].apply(
      lambda x: target_op(x, validation_target))
  workload_df['index target reached'] = target_reached_step_indicator.apply(
       lambda x: np.argmax(x))
  summary_df['submission time'] = workload_df.apply(
      lambda x: x['accumulated_submission_time'][x['index target reached']], axis=1)
  summary_df['score'] = summary_df.apply(
      lambda x: x['submission time'] if x['target reached'] else np.inf, axis=1)

Let me know if this seems correct!

priyakasimbeg commented 6 months ago

Hi, thanks again for flagging this. Working on fixing the trials related bug so it returns the best trial.

For the summary_df, the time to best validation metric is just to give submitters some raw information (regardless of whether the target was achieved). I can add the columns you drafted to the summary if that is more useful information.

anana10c commented 6 months ago

I see, I had assumed that the summary_df was intended to provide the "official" score for each trial. It would be great if you could just add a comment to mention that!

priyakasimbeg commented 6 months ago

Ah ok I think I should probably rename the 'score' column to something more clear like submission time to target

priyakasimbeg commented 6 months ago

Fixed in https://github.com/mlcommons/algorithmic-efficiency/pull/694

Niccolo-Ajroldi commented 6 months ago

I see, I had assumed that the summary_df was intended to provide the "official" score for each trial. It would be great if you could just add a comment to mention that!

+1 on this, it's extremely misleading rn!