Open keivanB opened 4 years ago
Do you have several GPUs? maybe you will need to add custom options (which are forwarded to the FAHClient binary).
Do you get the nvidia-smi output also from inside the container ?
yeap, I have multiple gpus, I can get the smi output inside the container as well.
15:14:48: GPUs: 4 15:14:48: GPU 0: Bus:10 Slot:0 Func:0 AMD:4 Cedar PRO [Radeon HD 5450] 15:14:48: GPU 1: Bus:11 Slot:0 Func:0 NVIDIA:3 GK110 [Tesla K40m] 15:14:48: GPU 2: Bus:65 Slot:0 Func:0 NVIDIA:3 GK110 [Tesla K40m] 15:14:48: GPU 3: Bus:66 Slot:0 Func:0 NVIDIA:3 GK110 [Tesla K40m] 15:14:48: CUDA Device 0: Platform:0 Device:0 Bus:11 Slot:0 Compute:3.5 Driver:10.2 15:14:48: CUDA Device 1: Platform:0 Device:1 Bus:65 Slot:0 Compute:3.5 Driver:10.2 15:14:48: CUDA Device 2: Platform:0 Device:2 Bus:66 Slot:0 Compute:3.5 Driver:10.2 15:14:48:OpenCL Device 0: Platform:0 Device:0 Bus:11 Slot:0 Compute:1.2 Driver:440.59 15:14:48:OpenCL Device 1: Platform:0 Device:1 Bus:65 Slot:0 Compute:1.2 Driver:440.59 15:14:48:OpenCL Device 2: Platform:0 Device:2 Bus:66 Slot:0 Compute:1.2 Driver:440.59
I am testing this image in order to scale it to Chamelon Cloud servers. We have some capacity and trying to help. The output above is from my local server that I am running the tests, but we have multiple Tesla P100 GPU nodes available, so it would be the same situation, at least two gpus per node. I really appreciate your help to this up an running, so I can scale it a little bit.
from inside the docker image +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K40m Off | 00000000:0B:00.0 Off | 0 | | N/A 42C P0 66W / 235W | 96MiB / 11441MiB | 1% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40m Off | 00000000:41:00.0 Off | 0 | | N/A 27C P8 19W / 235W | 11MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40m Off | 00000000:42:00.0 Off | 0 | | N/A 31C P8 21W / 235W | 11MiB / 11441MiB | 0% Default | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| +-----------------------------------------------------------------------------+
Can you try to run foldingathome directly on your system, without containers? If so, does it pick up all GPUs?
Another idea would be to use several containers, and let each one only work on one GPU. So you would start the first one with parameter "--opencl-index 0", the second with "--opencl-index 1" etc.
Finally, maybe that Radeon card is disturbing it. I guess it's not useable from inside the container? That might need some other options or a customised config.xml. For testing, you can always edit (or docker cp) the "client.xml" file and adapt the "slots"; the restart the container.
Sorry, my personal experience with foldingathome is very limited! After a brief try on a home PC in the 00's, I switched to BOINC and fully-open source projects. It's only now that I came back to it, and the first thing I did was to put it in a container.
I am testing the image on device with NVIDIA P100, The system has the nvidia driver installed and I get the nvidia-smi output. but the docker seems to fail to work properly with the GPU:
16:48:19:ERROR:WU02:FS02:Failed to start core: OpenCL device matching slot 2 not found, try setting 'opencl-index' manually