spacemeshos / gpu-post

Spacemesh proof of space time gpu optimized setup
GNU General Public License v3.0
29 stars 9 forks source link

Testapp core-dump when listing providers on ubuntu 20.04 (CUDA) #40

Closed avive closed 3 years ago

avive commented 3 years ago

List available providers with the test app from this linux release artifacts: https://github.com/spacemeshos/gpu-post/actions/runs/863441063 on Ubuntu 20.04 w one ore more Nvidia gpu supporting CUDA. Result: lib can't see nvidia-gpus that should be available. Expected: be able to use gpus as providers. Nvidia Driver Version: 460.73.01 CUDA Version: 11.2 Previous versions of the lib and the test app work on the same system.

~/latest$ echo $LD_LIBRARY_PATH
.
ls -la
total 14160
-rwxrwxr-x  1 avive avive 14117488 May 23 11:04 libgpu-setup.so
-rwxrwxr-x  1 avive avive   364784 May 23 11:04 test_app
./test_app -l
Available POST compute providers:
  0: [CPU] CPU
Segmentation fault (core dumped)
AndrewAR2 commented 3 years ago

Most likely on this system there is no support for CUDA 11. Build for Ubuntu 18.04 and CUDA 10.2 works. Segmentation fault occurs at program exiting and needs to be investigated further.

avive commented 3 years ago

Try running nvidia-smi - it returns the driver Cuda version and I believe it is 11 - latest Nvidia driver. Same system worked fine with earlier versions of the lib. These are the latest generation Nvidia GPUs and they should support Cuda 11. This is the output:

Mon May 24 16:06:40 2021
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:00:05.0 Off | 0 | | N/A 43C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

avive commented 3 years ago

Looks like the 10GB boot-disk was running out of free space. Increasing disk size.

avive commented 3 years ago

I'm tiding up the packages on this box and will try again.

avive commented 3 years ago

I reinstalled drivers versions 460 (recommended for this gpu) and there are no more errors in nvidia-smi and it shows cuda version 11.2.

Here's what I get - no cuda providers:

avive@gpu-post-miner-1:~$ cd latest
avive@gpu-post-miner-1:~/latest$ ls
libgpu-setup.so  test_app
avive@gpu-post-miner-1:~/latest$ ls -la
total 14160
drwxrwxr-x  2 avive avive     4096 May 23 11:05 .
drwxr-xr-x 12 avive avive     4096 May 23 11:05 ..
-rwxrwxr-x  1 avive avive 14117488 May 23 11:04 libgpu-setup.so
-rwxrwxr-x  1 avive avive   364784 May 23 11:04 test_app
avive@gpu-post-miner-1:~/latest$ ./test_app -l
./test_app: symbol lookup error: ./test_app: undefined symbol: spacemesh_api_logging
avive@gpu-post-miner-1:~/latest$ export LD_LIBRARY_PATH=.
avive@gpu-post-miner-1:~/latest$ ./test_app -l
Available POST compute providers:
  0: [CPU] CPU
Segmentation fault (core dumped)
avive@gpu-post-miner-1:~/latest$ 
avive commented 3 years ago

On the same system - run make test in /home/avive/pos-server. You can see that the older version of the lib used in the rust test code sees the 2 gpus just fine. So this is an issue with recent lib builds.

avive commented 3 years ago

We need the lib working on ubuntu 20 and not only on 18.

AndrewAR2 commented 3 years ago

Ubuntu 18.04 build works fine on ubuntu 20 except for the program termination problem.

AndrewAR2 commented 3 years ago

I downgraded the CUDA version to 11.2. Now the cards are detected.

avive commented 3 years ago

New library works okay after change to use Cuda 11.2 lib for ubuntu 20, however there's still a core dump in the test app. Is this a lib bug or a test app code issue? @AndrewAR2

avive@gpu-post-miner-1:~/test$ ./gpu-setup-test -l
Available POST compute providers:
  0: [CUDA] Tesla T4
  1: [CUDA] Tesla T4
  2: [CPU] CPU
Segmentation fault (core dumped)
AndrewAR2 commented 3 years ago

This error is due to an old version of libgpu-setup.so in /lib that does not match the current API.

avive commented 3 years ago

We confirmed this is an issue with the latest lib on Cuda / Ubuntu 20.04 systems.

avive commented 3 years ago

@AndrewAR2 is this issue fixed in v0.1.17?

AndrewAR2 commented 3 years ago

Yes!