Decrease criteo1tb eval bsz

mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.

https://mlcommons.org/en/groups/research-algorithms/

Apache License 2.0

321 stars 62 forks source link

Decrease criteo1tb eval bsz #641

Closed priyakasimbeg closed 6 months ago

priyakasimbeg commented 7 months ago

Criteo1tb OOMs during eval for some third party training algorithms in PyTorch. We're exploring reducing the criteo1tb eval bsz on both JAX and PyTorch.

The AIs of this issue are to:

[x] Investigate if reducing eval bsz by 4x significantly impacts the run time.
[x] If not, update the bsz.
[x] Clarify in documentation that submitters do not have control over the eval bsz.