Closed avive closed 3 years ago
Most likely on this system there is no support for CUDA 11. Build for Ubuntu 18.04 and CUDA 10.2 works. Segmentation fault occurs at program exiting and needs to be investigated further.
Try running nvidia-smi
- it returns the driver Cuda version and I believe it is 11 - latest Nvidia driver. Same system worked fine with earlier versions of the lib. These are the latest generation Nvidia GPUs and they should support Cuda 11. This is the output:
Mon May 24 16:06:40 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 |
| N/A 43C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Looks like the 10GB boot-disk was running out of free space. Increasing disk size.
I'm tiding up the packages on this box and will try again.
I reinstalled drivers versions 460 (recommended for this gpu) and there are no more errors in nvidia-smi
and it shows cuda version 11.2.
Here's what I get - no cuda providers:
avive@gpu-post-miner-1:~$ cd latest
avive@gpu-post-miner-1:~/latest$ ls
libgpu-setup.so test_app
avive@gpu-post-miner-1:~/latest$ ls -la
total 14160
drwxrwxr-x 2 avive avive 4096 May 23 11:05 .
drwxr-xr-x 12 avive avive 4096 May 23 11:05 ..
-rwxrwxr-x 1 avive avive 14117488 May 23 11:04 libgpu-setup.so
-rwxrwxr-x 1 avive avive 364784 May 23 11:04 test_app
avive@gpu-post-miner-1:~/latest$ ./test_app -l
./test_app: symbol lookup error: ./test_app: undefined symbol: spacemesh_api_logging
avive@gpu-post-miner-1:~/latest$ export LD_LIBRARY_PATH=.
avive@gpu-post-miner-1:~/latest$ ./test_app -l
Available POST compute providers:
0: [CPU] CPU
Segmentation fault (core dumped)
avive@gpu-post-miner-1:~/latest$
On the same system - run make test
in /home/avive/pos-server. You can see that the older version of the lib used in the rust test code sees the 2 gpus just fine. So this is an issue with recent lib builds.
We need the lib working on ubuntu 20 and not only on 18.
Ubuntu 18.04 build works fine on ubuntu 20 except for the program termination problem.
I downgraded the CUDA version to 11.2. Now the cards are detected.
New library works okay after change to use Cuda 11.2 lib for ubuntu 20, however there's still a core dump in the test app. Is this a lib bug or a test app code issue? @AndrewAR2
avive@gpu-post-miner-1:~/test$ ./gpu-setup-test -l
Available POST compute providers:
0: [CUDA] Tesla T4
1: [CUDA] Tesla T4
2: [CPU] CPU
Segmentation fault (core dumped)
This error is due to an old version of libgpu-setup.so in /lib that does not match the current API.
We confirmed this is an issue with the latest lib on Cuda / Ubuntu 20.04 systems.
@AndrewAR2 is this issue fixed in v0.1.17?
Yes!
List available providers with the test app from this linux release artifacts: https://github.com/spacemeshos/gpu-post/actions/runs/863441063 on Ubuntu 20.04 w one ore more Nvidia gpu supporting CUDA. Result: lib can't see nvidia-gpus that should be available. Expected: be able to use gpus as providers. Nvidia Driver Version: 460.73.01 CUDA Version: 11.2 Previous versions of the lib and the test app work on the same system.