mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

Update documentation for training to be accurate #425

Closed TheKanter closed 2 months ago

TheKanter commented 3 years ago

https://github.com/mlperf/training/blob/master/README.md seems stale.

Should we list the current training benchmarks/datasets/accuracy?

johntran-nv commented 1 year ago

I agree that this README is not useful at the moment. I think most of the useful information is in the training_policies repo, which begs the question: why does that need to be a separate repo? At minimum, we should point people to the training rules and the contributing guidelines. But maybe we could consider merging the repos as well? Does anyone have history/context on why they need to be separate?

TheKanter commented 1 year ago

@petermattson Originally set it up so that:

  1. Training rules
  2. submission rules
  3. Training code

Were all separate.

The submission rules are used by other benchmarks (E.g., inference, HPC).

The training rules are used by other benchmarks (e.g., HPC).

So we have this complicated inheritance scheme that makes things somewhat complicated and hard to understand.

Additionally, it is difficult in GitHub to enforce cross-repo checks (e.g., if we wanted a checker that would ensure training code and rules are consistent).

I think it is possible to revisit, but this is definitely something that would impact all benchmarks and require a big refactoring. I think it could also significantly enhance understandability.

I understand the idea that having a single place to change things is attractive, but that conceptually favors writes (change rules) over reads (understand rules).

petermattson commented 1 year ago

The write benefit is less about less work and more about knowledge sharing -- find an issue in one place a propagate.

That said, we're probably over shared right now.

I believe there's a way to do document "#include" -- which could potentially let us have one doc per benchmarks that pulls a few well-defined pieces from other places.

This would let us increase or decrease sharing gradually.

On Sun, Nov 27, 2022 at 6:29 PM David Kanter @.***> wrote:

@petermattson https://github.com/petermattson Originally set it up so that:

  1. Training rules
  2. submission rules
  3. Training code

Were all separate.

The submission rules are used by other benchmarks (E.g., inference, HPC).

The training rules are used by other benchmarks (e.g., HPC).

So we have this complicated inheritance scheme that makes things somewhat complicated and hard to understand.

Additionally, it is difficult in GitHub to enforce cross-repo checks (e.g., if we wanted a checker that would ensure training code and rules are consistent).

I think it is possible to revisit, but this is definitely something that would impact all benchmarks and require a big refactoring. I think it could also significantly enhance understandability.

I understand the idea that having a single place to change things is attractive, but that conceptually favors writes (change rules) over reads (understand rules).

— Reply to this email directly, view it on GitHub https://github.com/mlcommons/training/issues/425#issuecomment-1328300435, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIVUHKJEU7YYN4EA6NFPG3WKOK7NANCNFSM4TGUVTHA . You are receiving this because you were mentioned.Message ID: @.***>

peladodigital commented 1 year ago

In an effort to clean up the git repo so we can maintain it better going forward, the MLPerf Training working group is closing out issues older than 2 years, since much has changed in the benchmark suite. If you think this issue is still relevant, please feel free to reopen. Even better, please come to the working group meeting to discuss your issue

TheKanter commented 1 year ago

This needs to be fixed. Please have @johntran-nv or @erichan1 put on agenda.

hiwotadese commented 2 months ago

Closing this because the readme right now list current benchmark and dataset. If there is something specific that is not listed in the readme create a new issue.