moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.34k stars 612 forks source link

Add support for devices with "service create" #1244

Open flx42 opened 8 years ago

flx42 commented 8 years ago

Initially reported: https://github.com/docker/docker/issues/24865, but I realized it actually belongs here. Feel free to close the other one if you want. Content of the original issue copied below.

Related: #1030

Currently, it's not possible to add devices with docker service create, there is no equivalent for docker run --device=/dev/foo.

I'm an author of nvidia-docker with @3XX0 and we need to add devices files (the GPUs) and volumes to the starting containers in order to enable GPU apps as services. See the discussion here: https://github.com/docker/docker/issues/23917#issuecomment-233670078 (summarized below).

We figured out how to add a volume provided by a volume plugin:

$ docker service create --mount type=volume,source=nvidia_driver_367.35,target=/usr/local/nvidia,volume-driver=nvidia-docker [...]

But there is no solution for devices, @cpuguy83 and @justincormack suggested using --mount type=bind. But it doesn't seem to work, it's probably like doing a mknod but without the proper device cgroup whitelisting.

$ docker service create --mount type=bind,source=/dev/nvidiactl,target=/dev/nvidiactl ubuntu:14.04 sh -c 'echo foo > /dev/nvidiactl'
$ docker logs stupefied_kilby.1.2445ld28x6ooo0rjns26ezsfg
sh: 1: cannot create /dev/nvidiactl: Operation not permitted

It's probably equivalent to this:

$ docker run -ti ubuntu:14.04                      
root@76d4bb08b07c:/# mknod -m 666 /dev/nvidiactl c 195 255
root@76d4bb08b07c:/# echo foo > /dev/nvidiactl
bash: /dev/nvidiactl: Operation not permitted

Whereas the following works (invalid arg is normal, but no permission error):

$ docker run -ti --device /dev/nvidiactl ubuntu:14.04
root@ea53a1b96226:/# echo foo > /dev/nvidiactl
bash: echo: write error: Invalid argument
thandal commented 3 years ago

In the vein of not-too-ugly workarounds, see also https://docs.nuvla.io/nuvla/advanced-usage/compose-options.html

allfro commented 3 years ago

no offence @thandal but that's a really ugly workaround 😂 . It unnecessarily exposes the docker socket to a container. I'm not comfortable doing that 😬

cjdcordeiro commented 3 years ago

I'm a bit biased here 😛 but exposing the docker socket is actually safe in many cases, and in fact many mainstream de facto tools do use it (cadvisor, traefik, etc.). the thing is that it can be dangerous...so it has acquired a bad reputation over time, even if people don't really understand how it can be dangerous. In the example posted by @thandal , I'd agree that is not the preferred solution, but when it comes to security, it all boils down to the nature of the container you are deploying. The one I wrote in the Nuvla docs complies with the following:

So in this regard, it is safe.

Now, obviously, if you're building a web application, in a multi-tenant infrastructure, then yes, I'd agree with @allfro and you should avoid exposing it.

allfro commented 3 years ago

We NEED device mapping for swarms. I'd hate to switch over to Kubernetes for something as trivial as mapping common devices such as /dev/tun across a cluster. We beg you Docker!

cpuguy83 commented 3 years ago

Maybe stop begging someone else to write features you need? That is why there is exactly one person working on this repo... in their spare time.

allfro commented 3 years ago

@cpuguy83 isn't swarmkit developed by Docker corp which is also commercially sold as part of Docker EE?

cpuguy83 commented 3 years ago

@allfro No. Docker sold off the EE stuff to Mirantis... but even before then Swarmkit had very little support.

mdegans commented 2 years ago

The docs say to use:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

But unsurprisingly this is broken in swarm.

Time to move to k8s.

mdegans commented 2 years ago

Maybe stop begging someone else to write features you need? That is why there is exactly one person working on this repo... in their spare time.

Maybe they can maintain their stuff? How long has GPU support been broken in swarm?

prologic commented 2 years ago

I know this is a 6 year old issue, but is there actually an open PR for this that just needs a bit of attention? Maybe I could help finish the code required to support this? 🤔

Stefan592 commented 2 years ago

Solution does not work for Debian11 Bullseye. Is there a new workaround for this?

https://github.com/docker/swarmkit/issues/1244#issuecomment-285935430

MohammedNoureldin commented 2 years ago

Hey, @allfro Have you found a solution? I need the exact same usage like you (for tun device). Did you switch to another solution or have you figured out a workaround?

radeksh commented 2 years ago

How can I help to finish that feature?

pjalusic commented 2 years ago

I really like workaround from @BretFisher https://github.com/moby/swarmkit/issues/1244#issuecomment-394343097 and here is how I adapted it for nodes that require a device:

Putting it all together, your services will have to change from this:

services:
  my-service-starter:
    image: docker
    command: 'docker run --name <name> --device /dev/bus/usb -e TOKEN=1234 -p 5000:5000 <image>'
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      placement:
        constraints:
          - node.labels.device_required == true

to this:

services:
  my-service-handler:
    image: docker
    command: 'docker-compose -f /docker-compose.yml up'
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/ubuntu/docker-compose.yml:/docker-compose.yml
    deploy:
      placement:
        constraints:
          - node.labels.device_required == true

networks:
  default:
    name: my_network
    driver: overlay
    attachable: true

(on manager) and

services:
  <name>:
    image: <image>
    restart: always
    container_name: <name>
    devices:
      - /dev/bus/usb
    environment:
      - TOKEN=1234
    ports:
      - 5000:5000

networks:
  default:
    name: my_network
    external: true

(/home/ubuntu/docker-compose.yml on nodes that require a device)

bighb69738 commented 2 years ago

Hi @pjalusic ,

services:
  <name>:
    image: <image>
    restart: always
    container_name: <name>
    devices:
      - /dev/bus/usb
    environment:
      - TOKEN=1234
    ports:
      - 5000:5000

networks:
  default:
    name: my_network
    external: true

But in the worker node, I need to depend on another servcie from manager node. Could you give me a example for the docker-compose.yml of worker node to add "depends_on" tag.

allfro commented 1 year ago

I developed a plugin in the end that allows me to map devices to containers: https://github.com/allfro/device-volume-driver. Hope it helps others. Unfortunately, it only works on systems that use cgroup v1 (alpine). I am looking for some help to develop the cgroupv2 support into the plugin. It works really well and I've used it to containerize x11 desktops that require access to fuse and the vmware graphics devices.

cc: @MohammedNoureldin

zikaeroh commented 1 year ago

After planning to redo my home server setup with swarm (so I can have multiple nodes), I discovered that this wasn't supported, and I needed it for VAAPI.

After looking through things, it seemed to me like this was a plumbing (and developer-hour) problem. Basing things on a previous PR series which added ulimit support to swarm, here is a chain of PRs which add devices in the most boring way; just plumbing it through the API as-is, no special management or API. Just what docker already supported outside of swarm.

I'm sure I've missed something, and I don't quite know how to get everything building together to test this (I typically run things from my package manager's installed docker), but maybe someone is willing to try the above out.

vadd98 commented 1 year ago

Hi, I'm trying the workaround in https://github.com/moby/swarmkit/issues/1244#issuecomment-1178706059 and it indeed works, but when I remove the stack the handler is successfully removed, while the privileged container in docker-compose.yml continue running and has to be killed manually using docker kill.

Any idea on what could be the issue?

coltonbh commented 1 year ago

The docs say to use:

deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

But unsurprisingly this is broken in swarm.

Time to move to k8s.

These docs are referencing the API for docker compose, not swarm (services or stacks). The correct (and functioning) API for a stack is:

services:
  my-gpu-service:
    ...
    deploy:
      resources:
        reservations:
          generic_resources:
            - discrete_resource_spec:
              kind: "NVIDIA-GPU"
              value: 2

This works if you've registered your GPUs in the /etc/docker/daemon.json file.

For anyone looking for device support for NVIDIA GPUs using Swarm I did a quick write up here summarizing two solutions. My write up was heavily inspired by the original gist I found on the subject here.

zeppmg commented 1 year ago

Hello, Thanks for the tip. Anyway in my case it doesn't work. After having investigated step by step, I've realized that I don't have a /sys/fs/cgroup/devices folder on any of my swarm nodes. Does anyone have an idea of where this can come from ?

sudo ls /sys/fs/cgroup
cgroup.controllers      cgroup.stat             cpuset.cpus.effective  dev-mqueue.mount  io.pressure       memory.stat                    sys-kernel-debug.mount
cgroup.max.depth        cgroup.subtree_control  cpuset.mems.effective  init.scope        io.stat           -.mount                        sys-kernel-tracing.mount
cgroup.max.descendants  cgroup.threads          cpu.stat               io.cost.model     memory.numa_stat  sys-fs-fuse-connections.mount  system.slice
cgroup.procs            cpu.pressure            dev-hugepages.mount    io.cost.qos       memory.pressure   sys-kernel-config.mount        user.slice
reisholmes commented 1 year ago

Hello, Thanks for the tip. Anyway in my case it doesn't work. After having investigated step by step, I've realized that I don't have a /sys/fs/cgroup/devices folder on any of my swarm nodes. Does anyone have an idea of where this can come from ?

sudo ls /sys/fs/cgroup
cgroup.controllers      cgroup.stat             cpuset.cpus.effective  dev-mqueue.mount  io.pressure       memory.stat                    sys-kernel-debug.mount
cgroup.max.depth        cgroup.subtree_control  cpuset.mems.effective  init.scope        io.stat           -.mount                        sys-kernel-tracing.mount
cgroup.max.descendants  cgroup.threads          cpu.stat               io.cost.model     memory.numa_stat  sys-fs-fuse-connections.mount  system.slice
cgroup.procs            cpu.pressure            dev-hugepages.mount    io.cost.qos       memory.pressure   sys-kernel-config.mount        user.slice

Also in this same situation. I was using this solution to passthrough the iGPU driver to PLEX on a dockerswarm host for hardware transcoding: https://pastebin.com/XY7GP18T

I had some new hardware which required running the latest version of ubuntu to recognise it but this uses the cgroups v2. At the moment I reverted back to using cgroups v1 via these instructions to get this working again: https://sleeplessbeastie.eu/2021/09/10/how-to-enable-control-group-v2/ Key bit: $ sudo sed -i -e 's/^GRUB_CMDLINE_LINUX=""/GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1"/' /etc/default/grub $ sudo update-grub $ sudo reboot

I will experiment with moving to cgroups v2 and a combination of generic resource advertising the iGPU to the service as soon as I have time via these two hints as outlined by @coltonbh :

https://gist.github.com/coltonbh/374c415517dbeb4a6aa92f462b9eb287 https://docs.docker.com/compose/gpu-support/#enabling-gpu-access-to-service-containers

If anyone has any idea how to correctly advertise a quicksync driver to a cgroup v2 using dockerswarm it would be highly appreciated. Alternatively, I guess I could migrate to kubernetes ;)

jvrobert commented 1 year ago

I'm getting strong "the perfect is the enemy of the good" vibes from this issue. Strongly in favor of just passing through the devices options and letting buyer beware.

allfro commented 1 year ago

I've written this hack and tried it with plex and it seems to work: https://github.com/allfro/device-mapping-manager. Essentially it runs a privileged container which listens for docker create events and inspects the mount points. If a mount is within the /dev folder, it will walk the mount path for character and block devices and apply the necessary device rules to make the devices available. This doesn't work with fuse yet because the default apparmor profile blocks mounts (ugh!) but it does work with graphics cards and other devices that don't require operations that are blocked by Docker's apparmor profile. It is inspired by the previous comments.

allfro commented 1 year ago

If anyone has any idea how to correctly advertise a quicksync driver to a cgroup v2 using dockerswarm it would be highly appreciated. Alternatively, I guess I could migrate to kubernetes ;)

@reisholmes check this out: https://github.com/allfro/device-mapping-manager