mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.59k stars 553 forks source link

DLRM Update #449

Closed mnaumovfb closed 3 years ago

mnaumovfb commented 3 years ago

In this update we implement

The sparse weight update: Added an option to convert to dense or coalesce sparse gradient, depending on its size and the large_grad_threshold parameter. Using coalesced or dense updates has better numerical properties than using sparse uncoalesced weight update. The use of this option is controlled by --mlperf-coalesce-sparse-grads command line argument. https://github.com/facebookresearch/dlrm/commit/2dd5acff265c218a801a9806dc378ac2073f65fc

The gradient accumulation change: Added an option to add gradients across multiple mini-batches, therefore simulating running with a larger mini-batch size, before making an optimizer step to update the weights. This option is controlled by --mlperf-grad-accum-iter command line argument. https://github.com/facebookresearch/dlrm/commit/1302c71624fa9dbe7f0c75fea719d5e58d33e059

github-actions[bot] commented 3 years ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

tgrel commented 3 years ago

Looks good to me.

sub-mod commented 3 years ago

recheck

tgrel commented 3 years ago

Hi @mnaumovfb! I think the master branch in the mlcommons repo has a different version of one of the commits. The message for the commit is "Updating DLRM (logging, part 2) (#397)". The hash for the mlcommons repo commit is f7b53362 whereas in your repo this same commit has a different hash: 15b1d23. As far as I see they only differ by the Author field. One easy way to handle this is to do a fresh clone from the mlcommons repo, only make that one modification (bump the dlrm submodule) and then force-push to this Pull Request. Do you think it'll work? If yes, could you please do that modification so that we can merge this?

mnaumovfb commented 3 years ago

Let me try to do it today. I'll keep you posted about it.