tud-zih-energy / FIRESTARTER

FIRESTARTER: A Processor Stress Test Utility. This repository contains the source code generator. Our releases (including the generated source code and precompiled binaries) are available at https://tu-dresden.de/zih/firestarter/.
GNU General Public License v3.0
116 stars 25 forks source link

Running on multiple A100s #36

Closed dasantonym closed 2 months ago

dasantonym commented 2 years ago

I tried to run the prebuilt CUDA 11 binary on an Ubuntu 20.04 system with Cuda 11.6 and eight A100 GPUs.

When it gets to the GPUs, the following error occurs:

  graphics processor characteristics:
Error: CUBLAS error at /home/runner/work/FIRESTARTER/FIRESTARTER/src/firestarter/Cuda/Cuda.cpp:311: error code = 1 (CUBLAS_STATUS_NOT_INITIALIZED), device index: 7
Error: CUBLAS error at /home/runner/work/FIRESTARTER/FIRESTARTER/src/firestarter/Cuda/Cuda.cpp:311: error code = 1 (CUBLAS_STATUS_NOT_INITIALIZED), device index: 2
Error: CUBLAS error at /home/runner/work/FIRESTARTER/FIRESTARTER/src/firestarter/Cuda/Cuda.cpp:311: error code = 1 (CUBLAS_STATUS_NOT_INITIALIZED), device index: 3
Error: CUBLAS error at /home/runner/work/FIRESTARTER/FIRESTARTER/src/firestarter/Cuda/Cuda.cpp:311: error code = 1 (CUBLAS_STATUS_NOT_INITIALIZED), device index: 4
Error: CUBLAS error at /home/runner/work/FIRESTARTER/FIRESTARTER/src/firestarter/Cuda/Cuda.cpp:311: error code = 1 (CUBLAS_STATUS_NOT_INITIALIZED), device index: 5
Error: CUBLAS error at /home/runner/work/FIRESTARTER/FIRESTARTER/src/firestarter/Cuda/Cuda.cpp:311: error code = 1 (CUBLAS_STATUS_NOT_INITIALIZED), device index: 6
Segmentation fault (core dumped)

There seems to be no error for device 1. nvidia-smi reports all eight GPUs normally.

dasantonym commented 2 years ago

After further review, I don't see how this would be related to Firestarter. More likely something with our setup, sorry!

dasantonym commented 2 years ago

Actually, it was not the setup. gpu_burn worked fine and everything looks normal.

rschoene commented 2 years ago

I currently can't reproduce it on our A100 system with CUDA 11.1 installed. We will have a more detailed look at the issue. Sorry for the inconvenience.

rschoene commented 2 years ago

I now also tried it with CUDA 11.6 on our A100 partition and still cannot reproduce your problem. Could you list the output of

ldd ./FIRESTARTER_CUDA_11.0

? Thanks.

rschoene commented 4 months ago

@dasantonym Could you check whether this persists, if we do not hear back, we will close the issue.

rschoene commented 2 months ago

Closed due to no feedback. Maybe fixed with update.