mlcommons / logging

MLPerf™ logging library
https://mlcommons.org/en/groups/best-practices-benchmark-infra
Apache License 2.0
30 stars 45 forks source link

Unet3d rcp fix v3.1 #327

Closed mmarcinkiewicz closed 1 year ago

mmarcinkiewicz commented 1 year ago

UNET3D RCPs don't exhibit a typical convex curve on a plot epochs to converge vs GBS. Instead, it's convergence is a bit chaotic - this is a result of the dataset being very small and the fact that training happens on GBS which are a large fraction of the dataset (1/3 or even 1/2). Because of that, the RCP pruning doesn't work as expected, potentially passing a questionable results. Here's the current plot of epochs to converge vs GBS: image

My proposition is to replace epochs to converge with samples to converge (like BERT and LLM currently do). To do that, I simply multiply the epochs to converge by samples per epoch. The new plot would look like: image The behaviour is much more stable, there are no visible dips for RCP pruning to malfunction.

If the PR is approved, I'll prepare appropriate changes to rules and the reference.

github-actions[bot] commented 1 year ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

nv-rborkar commented 1 year ago

Training WG 08/17: Generally accepted as ok in meeting.

pgmpablo157321 commented 1 year ago

@mmarcinkiewicz Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?

nv-rborkar commented 1 year ago

@pgmpablo157321 changes have been moved to 3.1.0 folder. Please take a look