sylabs / singularity

SingularityCE is the Community Edition of Singularity, an open source container platform designed to be simple, fast, and secure.
https://sylabs.io/docs/
Other
776 stars 98 forks source link

[feature request] Support Intel GPU #1094

Closed houyushan closed 5 months ago

houyushan commented 2 years ago

hello , Use Intel GPU(XPU )

my product needs to use the singularity/apptainer to use the Intel XPU (the latest Intel GPU for AI training). Is there a command like -- nv to support it at present, or will future versions provide similar commands? When will it be?

dtrudg commented 2 years ago

Hi @houyushan - we'll have to look into this a little more. It appears from looking at relevant Intel container images that the intention is to have libraries in the container, and make the /dev/dri devices available in the container.

This means that you should be able to run without any additional options, unless you are using --contain / --containall, in which case you will have to add -B /dev/dri to ensure the devices are available.

houyushan commented 2 years ago

okay, thank you,

I will also continue to research and test

elezar commented 1 year ago

Just a note: If we assume CDI support in the OCI mode, using a CDI spec generated for the Intel devices would allow injection of these.

See #813

dtrudg commented 1 year ago

@elezar - yep, thanks. This was a hope at the back of my mind :-)

pzehner commented 8 months ago

Hello, any news on this issue?

Using singularity CE version 4.0.2 with an Intel GPU Max 1550, I don't have access to the GPU, even if the card is listed in /dev/dri.

dtrudg commented 8 months ago

As mentioned in a comment above, Singularity's OCI-mode supports CDI (Container Device Interface) configuration for access to GPUs, which would includ Intel GPUs if a CDI configuration is available.

With regard to adding a direct Intel GPU flag for the default native, non-OCI, mode... generally adding this kind of hardware specific support into SingularityCE is dependent on either:

  1. The vendor, or a user, contributing the functionality as a pull request that they will also be able to assist with maintaining.
  2. The vendor, or a 3rd party, providing us (as a project) with access to the relevant GPU hardware on an ongoing basis so that we can develop and maintain the requested functionality.

NVIDIA GPU support comes under (2), as we have had signficant contributions from NVIDIA, and it is also trivial to access Tesla GPUs at reasonable cost via public cloud providers.

What we wish to avoid, when adding Intel GPU support, is the situation we find ourselves in with AMD GPUs / ROCm. The lack of access to data center AMD GPUs (capable of running latest ROCm) in the cloud, or by other means, makes maintaining ROCm support difficult / costly.

If you are able to, we would suggest that you indicate to Intel that support integrated into SingularityCE is important to you.

Without access to hardware, the minimum information required for us to add an experimental flag, without commitment that it will be well maintained, would be:

elezar commented 8 months ago

I would strongly recommend following the CDI route here instead of relying on vendor-specific logic in Singularity. If effort is to be spent, I would recommend adding (experiemental) CDI support to the native mode of singularity (see #1395) if the support is required there instead.

@kad do you have any visibility on the generation of CDI specification for Intel devices?

kad commented 8 months ago

I don't, but @byako and @tkatila will be good candidates to chime in here.

pzehner commented 8 months ago

I checked the OCI way and CDI, but I cannot access the GPU out of the box. I guess I should indicate a CDI file with --device. The documentation states that usual lookup directories are /etc/cdi and /var/run/cdi, but none of them exist. I tried to guess intel.com/gpu=all, but it was obviously incorrect.

It would be nice to have more documentation about this.

byako commented 8 months ago

The CDI specs are generated automatically at the moment only by the kubelet-plugin part of the DRA resource-driver.

If you don't need the dynamic creation of the specs - it's possible to create the specs manually, they are quite simple.

There is a chance however, that they will need to be fixed after reboot if you have multiple GPUs that are different, or if you have integrated GPU that also gets enabled in DRM, because the DRM devices' indexes are not persistent across reboots. For instance, /dev/dri/card0 can become card1, and card1 might become card0.

byako commented 8 months ago

Here's an example of CDI spec: sudo cat /etc/cdi/intel.com-gpu.yaml

cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dri/card1
      type: c
    - path: /dev/dri/renderD129
      type: c
  name: 0000:03:00.0-0x56a0
- containerEdits:
    deviceNodes:
    - path: /dev/dri/card0
      type: c
    - path: /dev/dri/renderD128
      type: c
  name: 0000:00:02.0-0x4680
kind: intel.com/gpu

the name field can be somewhat arbitrary albeit with spelling restrictions, if you just create /etc/cdi folder and paste the contents of the above snippet into file inside that folder, it should work, given that your runtime supports CDI.

sudo mkdir /etc/cdi
sudo vim /etc/cdi/mygpus.yaml

then --device intel.com/gpu=0000:03:00.0-0x56a0

pzehner commented 8 months ago

I see, is there a way to get these configuration files without writing them by hand? When I googled "intel gpu container device interface," I couldn't find anything like that. How is the user supposed to know this?

byako commented 8 months ago

Hello, any news on this issue?

Using singularity CE version 4.0.2 with an Intel GPU Max 1550, I don't have access to the GPU, even if the card is listed in /dev/dri.

Could you please add more details about this case: what was the command line you used with what options?

pzehner commented 8 months ago

My bad, I missed one of your answers. Hum, I'm not sure to understand this line:

The CDI specs are generated automatically at the moment only by the kubelet-plugin part of the DRA resource-driver.

Should I install Kubernetes as well? Noob question here.

Could you please add more details about this case: what was the command line you used with what options?

In my case, I have a machine with four Intel GPU Max 1550, and I want to run code withing an Intel OneAPI image. For the demonstration, I just use sycl-ls to list the SYCL-compatible devices (note that I'm not using the manual CDI file yet):

$ singularity run --oci docker://intel/oneapi-basekit:2024.0.1-devel-ubuntu20.04 sycl-ls   
Getting image source signatures
Copying blob 521f275cc58b done   | 
Copying blob 565c40052dc3 done   | 
Copying blob afcec6bc5983 done   | 
Copying blob 93b1720de081 done   | 
Copying blob bcd9c7c8e2dd done   | 
Copying blob 3c86603e9f04 done   | 
Copying blob 45a1c23aa4e7 done   | 
Copying config ba41f6c638 done   | 
Writing manifest to image destination
INFO:    Converting OCI image to OCI-SIF format
INFO:    Squashing image to single layer
INFO:    Writing OCI-SIF image
INFO:    Cleaning up.
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9460 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

As you can see, only the CPU is detected. This is what I should see:

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9460 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[opencl:gpu:4] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[opencl:gpu:5] Intel(R) OpenCL Graphics, Intel(R) Data Center GPU Max 1550 OpenCL 3.0 NEO  [23.22.26516.34]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
[ext_oneapi_level_zero:gpu:2] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
[ext_oneapi_level_zero:gpu:3] Intel(R) Level-Zero, Intel(R) Data Center GPU Max 1550 1.3 [1.3.26516]
byako commented 8 months ago

There is no need to install Kubernetes, I meant that the automated generation of the CDI specs at the moment available only in K8s.

Once you created the /etc/cdi dir and saved the yaml file into it, devices described in that yaml can be used by singularity.

You have to use the --device parameter in the command like I mentioned above - that will tell singularity to use the device that it finds in the CDI spec. https://docs.sylabs.io/guides/latest/user-guide/oci_runtime.html#sec-cdi.

The yaml file I quoted above is just an example. Check what is the DRM index of the GPU, for instance: ls -al /dev/dri/by-path/, and see which /dev/dri/cardX is linked to the Max 1550. You can see which PCI device Max 1550 is by running 'lspci | grep Display. When you know which/dev/dri/cardXis Max 1550, use that in the/etc/cdi/mygpus.yaml.renderdnode is not needed for Max1550, onlycardX`.

We'll work on finding the way to generate CDI specs or at least documenting it.

pzehner commented 8 months ago

Ok, I see. I think it would be nice to have a better way to generate these CDI specs. The logics from the Kubernetes plugin could be extracted.

If I'm not wrong, you can completely deduce them from the structure in /dev/dri, right?

pzehner commented 8 months ago

So, I tried with the example CDI specs file that I adapted for my hardware, but the GPU is still not visible from within the container:

$ singularity run --oci --device intel.com/gpu=0000:29:00.0 docker://intel/oneapi-basekit:2024.0.1-devel-ubuntu20.04 sycl-ls 
INFO:    Using cached OCI-SIF image
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) CPU Max 9460 OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

Where the CDI specs look like:

cdiVersion: 0.5.0
containerEdits: {}
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dri/card1
      type: c
    - path: /dev/dri/renderD128
      type: c
  name: 0000:29:00.0
...
kind: intel.com/gpu
tkatila commented 8 months ago

@pzehner can you check whether /dev/dri/ has card and renderD devices? If they are, it might be an access rights issue with the actual devices.

pzehner commented 8 months ago

Yes, I have the correct devices listed in /dev/dri, and I can access them outside of the container.

tkatila commented 8 months ago

roger, I downloaded the same image and tried it within docker. sycl-ls didn't list GPUs for me either. I'll try to understand what is with it.

tkatila commented 8 months ago

I don't exactly know why sycl-ls doesn't detect the GPUs. What I did notice is that 2024.0.1-devel-ubuntu22.04 version does detect them. Comparing the images didn't reveal anything obvious nor could I make the 20.04 variant functional by installing packages.

I'd use the 22.04 variant as a workaround, if that suites you.

pzehner commented 7 months ago

I think using an up-to-date image is acceptable.

dtrudg commented 5 months ago

Closing this issue. CDI support is available in --oci mode, and appears to work with the correct image.

Support for Intel GPUs in native mode would come via #1395 - however this is not on the development roadmap firmly at this time.