mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 561 forks source link

Does DLRM_v2 support H100? #635

Closed xyyintel closed 3 months ago

xyyintel commented 1 year ago

Does DLRM_v2 support H100? If supported, what is the env you used? I have tried cuda11.8 + pytorch 1.14.0 or pytorch 2.1 + torchrec 0.3.2 or torchrec 0.4.0 + fbgemm_gpu 0.3.2 or 0.4.1. However, none of above env works.

erichan1 commented 1 year ago

We never got to test this on H100 I think. cc @janekl if you've tried on H100.

janekl commented 1 year ago

Right, the development and testing involved only A100.

To achieve this at least you would need CUDA 12 and compile FBGEMM for Hopper architecture (SM90). But I have never tried this myself.

ShriyaPalsamudram commented 3 months ago

Closing as the reference was not tested on H100s Note that there were multiple H100 DLRMv2 submissions in the MLPerf Training v4.0 round as shown in the results table.

Training v4.0 implementations are in this repo