GPU integration with nanos

zeroecco commented 1 month ago

Hello!

I am attempting to get a working unikernel on my workstation (going through this blog: https://nanovms.com/dev/tutorials/gpu-accelerated-computing-nanos-unikernels) but running into a number of hurdles I thought I should document and ask for assistance on:

First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos
Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.
Third thing: Following the guide, the ops config is wrong for current code. "Klibs": ["gpu_nvidia"], needs to be outside of the run config based on the ops docs (ops also complained about the config being wrong).
Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?

Here is the current output:

ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
Invalid GPU type 'nvidia-tesla-t4'
cat ops.config
{
  "RunConfig": {
    "GPUs": 1,
    "GPUType": "nvidia-tesla-t4"
  },
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"]
}

eyberg commented 1 month ago

that article was written a while ago

are you trying to run this locally or in the cloud? if local - there is additional work that one has to use it locally: https://github.com/nanovms/ops/pull/1528 - the older article you linked was for gcp specifically, (we have an outstanding task to document the onprem setup https://github.com/nanovms/ops-documentation/issues/430 )

francescolavra commented 1 month ago

* First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos

There has indeed been a recent change (in https://github.com/nanovms/nanos/pull/2011) in the nanos interrupt API, and the nvidia klib hasn't been updated yet to adapt to that change. If you want, you can check out the kernel version prior to that PR and build the klib against that. Also please note that in order to be able to build the klib you have to build nanos itself first

* Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.

Copying both is fine, the driver will pick the right one depending on which GPU type it detects

* Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?

The only "GPUType" you can set in the config when running locally is "pci-passthrough" (but you can just omit the "GPUType" option altogether, since pci-passthrough is the default setting). This will detect the GPU(s) connected to the PCI bus of your machine, and should work with any supported Nvidia GPU type.

rinor commented 1 month ago

last time I checked the build, the only change I made was:

diff --git a/kernel-open/nvidia/nv-msi.c b/kernel-open/nvidia/nv-msi.c
index 020ef53..a0c2be9 100644
--- a/kernel-open/nvidia/nv-msi.c
+++ b/kernel-open/nvidia/nv-msi.c
@@ -55,7 +55,8 @@ void NV_API_CALL nv_init_msi(nv_state_t *nv)
         }
         else
         {
-            msi_format(&address, &data, nv->interrupt_line);
+            u32 target_cpu = irq_get_target_cpu(irange(0, 0));
+            msi_format(&address, &data, nv->interrupt_line, target_cpu);
             pci_cfgwrite(dev, cp + 4, 4, address);    /* address low */
             pci_cfgwrite(dev, cp + 8, 4, 0);          /* address high */
             pci_cfgwrite(dev, cp + 12, 4, data);      /* data */

can't confirm that it is correct, just that it builds fine with the latest nanos.

francescolavra commented 1 month ago

Yes, that is a correct change. Thanks

zeroecco commented 1 month ago

thanks for all this feedback! I will try it and let you know ASAP

francescolavra commented 1 month ago

https://github.com/nanovms/gpu-nvidia/pull/5 has been merged in our gpu-nvidia repository, so the klib now builds successfully against the master branch of nanos.

zeroecco commented 1 month ago

closer:

root@north:~/r0uk# ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (root@north)  Mon May 13 02:04:21 AM UTC 2024
Loaded the UVM driver, major device number 0.
2024/05/15 17:50:16 Listening...on 8080
en1: assigned FE80::30A6:AEFF:FE3E:B03D
^Cqemu-system-x86_64: terminating on signal 2
signal: killed
root@north:~/r0uk#

francescolavra commented 1 month ago

As written in the tutorial, the line "Loaded the UVM driver, major device number 0" indicates that the GPU klib was loaded successfully, and the GPU attached to your instance is available for your application to use. Are you facing any issues?

zeroecco commented 1 month ago

not anymore on the nightly, thanks for your guidance

0x5459 commented 3 weeks ago

I am getting the following error (GeForce RTX 3080):

en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: failed to register character device.
klib automatic load failed (4)

francescolavra commented 3 weeks ago

The above error means the klib failed to create the /dev/nvidiactl file which is used by the userspace nvidia drivers to interface with the GPU. @0x5459 is there anything already at that path in the image you are using? How are you starting the Nanos instance? If you are using Ops, can you share your command line and your json configuration file?

0x5459 commented 3 weeks ago

I suspect that the inconsistency between my CUDA version and driver version is causing the issue. My program is compiled with CUDA 11. Now, I am trying to install CUDA 12. I will reply here with any updates.

My config:

{
  "RebootOnExit": true,
  "ManifestPassthrough": {
    "readonly_rootfs": "true"
  },
  "Env": {
    "RUST_BACKTRACE": "1",
    "RUST_LOG": "debug",
  },
  "Program": "c2-test",
  "KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"],
  "Mounts": {
    "/root/dataset": "/dataset"
  },
  "RunConfig": {
    "CPUs": 32,
    "Memory": "64g",
    "GPUs": 1
  }
}

0x5459 commented 3 weeks ago

I have tried to compile my program with cuda12.2 but still get the same error. Could you give me some help? @francescolavra

francescolavra commented 3 weeks ago

The problem is in the fact that your root filesystem is being configured as read-only (via the "readonly_rootfs": "true" option in your config). This prevents the klib from creating the /dev/nvidiactl file, and that causes the "failed to register character device" error.

0x5459 commented 3 weeks ago

@francescolavra Hi, I have a new issue.

I built deviceQuery program from cuda-samples using following config:

{
  "Program": "deviceQuery",
  "KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"],
  "RunConfig": {
    "GPUs": 1
  }
}

But I got an error below:

$ ops instance logs test

en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (root@ipfs)  Tue Jun  4 01:42:09 PM CST 2024
Loaded the UVM driver, major device number 0.
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 304
-> OS call failed or operation not supported on this OS
Result = FAIL

$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Could you please provide guidance on how to resolve this issue? Thank you very much for your time and help.

francescolavra commented 3 weeks ago

The above error from cudaGetDeviceCount() may be due to missing or mismatching CUDA libraries in your image. You can get some clues as to what it's failing on by enabling tracing in the kernel, i.e. adding the --trace option to your ops run command line. The trace output will likely show the cause of the failure. Also, to verify that you have all CUDA libraries set up correctly in your host, you could run the deviceQuery program directly in the host (assuming you are on Linux and are using a GPU attached to your host) and see if it can query your GPU correctly.

0x5459 commented 3 weeks ago

Sorry. I have tried to enable trace, but I am still unable to determine the cause of the issue. :disappointed: trace log: https://github.com/0x5459/gpu_integration_with_nanos/blob/main/nanos_trace.log

I have created a repository to store all of my test files. Could you please provide guidance on how to resolve this issue when you are free? @francescolavra

Also, to verify that you have all CUDA libraries set up correctly in your host, you could run the deviceQuery program directly in the host (assuming you are on Linux and are using a GPU attached to your host) and see if it can query your GPU correctly.

I ran the deviceQuery program on the host. it works.

francescolavra commented 3 weeks ago

Thanks for providing details on your test environment. I see that you are using the Nvidia driver version 550.54.15; to avoid compatibility issues, you should use the same driver version as the version from which the Nanos klib is derived, which is 535.113.01 and can be downloaded at https://www.nvidia.com/download/driverResults.aspx/211711/en-us/. More specifically, the /lib/x86_64-linux-gnu/libcuda.so.1 file you put in the Nanos image should be the same as the libcuda.so.535.113.01 file you can find in the Nvidia Linux driver package.

leeyiding commented 1 week ago

Hello, I also failed to run the deviceQuery compiled under 535.113.01 driver and CUDA12.0 The log is as follows: ops run deviceQuery -c config.json -n

running local instance
booting /root/.ops/images/deviceQuery ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (circleci@02d850dae0db)  Fri Jun 21 02:11:26 AM UTC 2024
Loaded the UVM driver, major device number 0.
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

For detailed trace logs, see https://github.com/leeyiding/nanos_cuda_deviceQuery/blob/main/trace.log

francescolavra commented 6 days ago

@leeyiding I suggest you first try running the pre-built binaries from the CUDA demo suite (in the CUDA v12.2 toolkit you can find them in the cuda_demo_suite/extras/demo_suite/ folder), among which there is the deviceQuery program. The current version of the Nanos GPU klib has been tested successfully with the pre-built CUDA v12.2 deviceQuery binary (ensure you have the latest source of the klib, as there has been a recent fix in https://github.com/nanovms/gpu-nvidia/pull/7). Example output from deviceQuery when run on a GCP instance equiped with a Tesla T4 GPU:

en1: assigned 10.240.0.106
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (francesco@debian)  Fri 21 Jun 2024 08:04:52 PM CEST
Loaded the UVM driver, major device number 0.
device-query Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

en1: assigned FE80::4001:AFF:FEF0:6A
Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 14931 MBytes (15655829504 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 4
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1, Device0 = Tesla T4
Result = PASS

leeyiding commented 5 days ago

@leeyiding I suggest you first try running the pre-built binaries from the CUDA demo suite (in the CUDA v12.2 toolkit you can find them in the cuda_demo_suite/extras/demo_suite/ folder), among which there is the deviceQuery program. @leeyiding我建议你先尝试运行CUDA demo suite中的预构建二进制文件（在CUDA v12.2工具包中，你可以在cuda_demo_suite/extras/demo_suite/文件夹中找到它们），其中有deviceQuery程序。 The current version of the Nanos GPU klib has been tested successfully with the pre-built CUDA v12.2 deviceQuery binary (ensure you have the latest source of the klib, as there has been a recent fix in nanovms/gpu-nvidia#7). Example output from deviceQuery when run on a GCP instance equiped with a Tesla T4 GPU: 当前版本的Nanos GPU klib已成功通过预构建的CUDA v12.2 deviceQuery二进制文件的测试（确保您拥有klib的最新源代码，因为最近在 nanovms/gpu-nvidia#7 中进行了修复）。在配备Tesla T4 GPU的GCP实例上运行时，deviceQuery的输出示例：

en1: assigned 10.240.0.106
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (francesco@debian)  Fri 21 Jun 2024 08:04:52 PM CEST
Loaded the UVM driver, major device number 0.
device-query Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

en1: assigned FE80::4001:AFF:FEF0:6A
Detected 1 CUDA Capable device(s)

Device 0: "Tesla T4"
  CUDA Driver Version / Runtime Version          12.2 / 12.2
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 14931 MBytes (15655829504 bytes)
  (40) Multiprocessors, ( 64) CUDA Cores/MP:     2560 CUDA Cores
  GPU Max Clock rate:                            1590 MHz (1.59 GHz)
  Memory Clock rate:                             5001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 3 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 4
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.2, CUDA Runtime Version = 12.2, NumDevs = 1, Device0 = Tesla T4
Result = PASS

Thank you very much. Through the pre-built CUDA demo suite and the latest klibs, I have successfully run the nightly version.

nanovms / ops

GPU integration with nanos #1621