mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

The default training script of DLRM v2 does not reach the reported AUC. #634

Closed Kevin0624 closed 2 months ago

Kevin0624 commented 1 year ago

Hi Teams,

I have run the default training script with the following changes based on the results table **1. GLOBAL_BATCH_SIZE=16384

  1. WORLD_SIZE=4 ( 4 A100 40GB GPUs)**

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is: test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

erichan1 commented 1 year ago

cc @janekl Any thoughts on this?

janekl commented 1 year ago

Hello, I have two questions first that hopefully will help us to figure this out more effectively:

  1. Could you share the exact command you tried?
  2. What are "global_batch_size" and "opt_base_learning_rate" in the logs produced?

I expect that (batch size, learning rate) = (16384, 0.004) should work reasonably well and stably. But bear in mind that results may vary from run to run -- as the model is initialized randomly -- so it's best to run it several times.

Also, note that the threshold is 0.80275, not 0.803.

Finally, for MLPerf you should look at "eval_accuracy" logs for the validation set, not test set (it is better just not to use ---evaluate_on_training_end flag to avoid confusion here).

kkkparty commented 5 months ago

Hi Teams,

I have run the default training script with the following changes based on the results table 1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)

P.S. I did not use the unique flags. Because I could not find the corresponding argument in dlrm_main.py.

the training result is: test AUC: 79.86% ( target: 80.30%)

Is there anyone have any idea?

did you load dense part ? i encontered with just sparse weights, but none dense. will you glad to show me how to check dense weights?

ShriyaPalsamudram commented 2 months ago

@Kevin0624 has this been resolved? Closing as it has been more than a year since the last activity