Closed mmarcinkiewicz closed 1 year ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
Training WG 08/17: Generally accepted as ok in meeting.
@mmarcinkiewicz Is this RCP update meant for training v3.1? In that case could you update your branch and move the changes into the training-3.1.0 folder?
@pgmpablo157321 changes have been moved to 3.1.0 folder. Please take a look
UNET3D RCPs don't exhibit a typical convex curve on a plot epochs to converge vs GBS. Instead, it's convergence is a bit chaotic - this is a result of the dataset being very small and the fact that training happens on GBS which are a large fraction of the dataset (1/3 or even 1/2). Because of that, the RCP pruning doesn't work as expected, potentially passing a questionable results. Here's the current plot of epochs to converge vs GBS:
My proposition is to replace epochs to converge with samples to converge (like BERT and LLM currently do). To do that, I simply multiply the epochs to converge by samples per epoch. The new plot would look like: The behaviour is much more stable, there are no visible dips for RCP pruning to malfunction.
If the PR is approved, I'll prepare appropriate changes to rules and the reference.