Closed mnaumovfb closed 3 years ago
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
Looks good to me.
recheck
Hi @mnaumovfb! I think the master branch in the mlcommons repo has a different version of one of the commits. The message for the commit is "Updating DLRM (logging, part 2) (#397)". The hash for the mlcommons repo commit is f7b53362 whereas in your repo this same commit has a different hash: 15b1d23. As far as I see they only differ by the Author field. One easy way to handle this is to do a fresh clone from the mlcommons repo, only make that one modification (bump the dlrm submodule) and then force-push to this Pull Request. Do you think it'll work? If yes, could you please do that modification so that we can merge this?
Let me try to do it today. I'll keep you posted about it.
In this update we implement
The sparse weight update: Added an option to convert to dense or coalesce sparse gradient, depending on its size and the large_grad_threshold parameter. Using coalesced or dense updates has better numerical properties than using sparse uncoalesced weight update. The use of this option is controlled by --mlperf-coalesce-sparse-grads command line argument. https://github.com/facebookresearch/dlrm/commit/2dd5acff265c218a801a9806dc378ac2073f65fc
The gradient accumulation change: Added an option to add gradients across multiple mini-batches, therefore simulating running with a larger mini-batch size, before making an optimizer step to update the weights. This option is controlled by --mlperf-grad-accum-iter command line argument. https://github.com/facebookresearch/dlrm/commit/1302c71624fa9dbe7f0c75fea719d5e58d33e059