Closed terU3760 closed 2 years ago
Hi @terU3760 this is related to pros = device properties/capabilties of the CUDA device you are running the benchmark on.
Basically its expected to be a 8 (major CUDA DEVICE VERSION) 0 (minor CUDA DEVICE VERSION)
If I remember I think V100 is 7.5 while A100 are 8.0
I wrote a small tool to get device props I will share it later
also be aware that its very likely that you'll have to change the batch size to fit into the V100 memory (v100 = 16G ; v100s = 32G)
Hi, @jqueguiner . Thanks a lot! I have resolved it by myself.
Out of curiosity how did you solve it ?
@jqueguiner Spend more money on both sides!
--verbose ?
@jqueguiner I have said. Spend more money on both sides. After we replace V100 with A100, the error has never occurred again and we solved the problem.
yes so moving to A100 makes it move to CUDA 8.0 capabilities ;-) thanks a lot for the reply !
Hi, all. Tried to run the mlcommons training_results_v1.0 pytorch bert model on V100 multi-GPU, but failed. Have modified the script of run_test.sh to as the following:
and run, but reports the following error:
What could be the cause?