mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

Does DLRM_v2 support H100? #635

Open xyyintel opened 1 year ago

xyyintel commented 1 year ago

Does DLRM_v2 support H100? If supported, what is the env you used? I have tried cuda11.8 + pytorch 1.14.0 or pytorch 2.1 + torchrec 0.3.2 or torchrec 0.4.0 + fbgemm_gpu 0.3.2 or 0.4.1. However, none of above env works.

erichan1 commented 1 year ago

We never got to test this on H100 I think. cc @janekl if you've tried on H100.

janekl commented 1 year ago

Right, the development and testing involved only A100.

To achieve this at least you would need CUDA 12 and compile FBGEMM for Hopper architecture (SM90). But I have never tried this myself.