moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.3k stars 607 forks source link

Add support for devices with "service create" #1244

Open flx42 opened 7 years ago

flx42 commented 7 years ago

Initially reported: https://github.com/docker/docker/issues/24865, but I realized it actually belongs here. Feel free to close the other one if you want. Content of the original issue copied below.

Related: #1030

Currently, it's not possible to add devices with docker service create, there is no equivalent for docker run --device=/dev/foo.

I'm an author of nvidia-docker with @3XX0 and we need to add devices files (the GPUs) and volumes to the starting containers in order to enable GPU apps as services. See the discussion here: https://github.com/docker/docker/issues/23917#issuecomment-233670078 (summarized below).

We figured out how to add a volume provided by a volume plugin:

$ docker service create --mount type=volume,source=nvidia_driver_367.35,target=/usr/local/nvidia,volume-driver=nvidia-docker [...]

But there is no solution for devices, @cpuguy83 and @justincormack suggested using --mount type=bind. But it doesn't seem to work, it's probably like doing a mknod but without the proper device cgroup whitelisting.

$ docker service create --mount type=bind,source=/dev/nvidiactl,target=/dev/nvidiactl ubuntu:14.04 sh -c 'echo foo > /dev/nvidiactl'
$ docker logs stupefied_kilby.1.2445ld28x6ooo0rjns26ezsfg
sh: 1: cannot create /dev/nvidiactl: Operation not permitted

It's probably equivalent to this:

$ docker run -ti ubuntu:14.04                      
root@76d4bb08b07c:/# mknod -m 666 /dev/nvidiactl c 195 255
root@76d4bb08b07c:/# echo foo > /dev/nvidiactl
bash: /dev/nvidiactl: Operation not permitted

Whereas the following works (invalid arg is normal, but no permission error):

$ docker run -ti --device /dev/nvidiactl ubuntu:14.04
root@ea53a1b96226:/# echo foo > /dev/nvidiactl
bash: echo: write error: Invalid argument
stevvooe commented 7 years ago

@flx42 For the container runtime, devices require special handling (a mknod syscall), so mounts won't work. We'll probably have to add some sort of support for this. (cc @crosbymichael)

Ideally, we'd like to be able to schedule over devices, as well.

cpuguy83 commented 7 years ago

@stevvooe Already have device support in the runtime, just not exposed in swarm.

flx42 commented 7 years ago

Ideally, we'd like to be able to schedule over devices, as well.

This question was raised here: https://github.com/docker/docker/issues/24750 But the discussion was redirected here: https://github.com/docker/docker/issues/23917, in order to have a single discussion thread.

flx42 commented 7 years ago

@stevvooe I quickly hacked a solution, it's not too difficult: https://github.com/flx42/swarmkit/commit/a82b9fb2b1f3387baa1e4d4447ba9af4f3e05f16 This is not a PR yet, would you be interested if I do one? Or are the swarmkit features frozen right now before 1.12? The next step would be to also modify the engine API.

flx42 commented 7 years ago

Forgot to mention that I can now run GPU containers by mimicking what nvidia-docker does:

./bin/swarmctl service create --device /dev/nvidia-uvm --device /dev/nvidiactl --device /dev/nvidia0 --bind /var/lib/nvidia-docker/volumes/nvidia_driver/367.35:/usr/local/nvidia --image nvidia/digits:4.0 --name digits
stevvooe commented 7 years ago

@flx42 I took a quick peak and the PR looks like a decent start. I am not sure about representing these as cluster-level resources for container startup. From an orchestration perspective, we have to match these up with announced resources at the node level, which might be okay. It might be better on ContainerSpec, but I'm not sure yet.

Go ahead and file as a [WIP] PR.

flx42 commented 7 years ago

@stevvooe Yeah, that's the biggest discussion point for sure.

In engine-api, devices are resources: https://github.com/docker/engine-api/blob/master/types/container/host_config.go#L249

But in swarmkit, resources are so far "fungible" objects like CPU shares and memory, with a base value and a limit. A device doesn't really fit that definition. For GPU apps we have devices that must be shared (/dev/nvidiactl) and devices that could be exclusively acquired (like /dev/nvidia0).

I decided to initially put devices into resources because there is already a function in swarmkit that creates a engine-api Resource object from a swarm Resource object: https://github.com/docker/swarmkit/blob/master/agent/exec/container/container.go#L301-L324 This method would also need to access the container spec.

I will file a PR soon to continue the discussion.

stevvooe commented 7 years ago

@flx42 Great!

We really aren't planning on following the same resource model from HostConfig for SwarmKit. In this case, we are instructing the container to mount these devices, which is specific to a container runtime. Other runtimes may not have a container or devices. Thus, I would err on ContainerSpec.

Now, I would like to see scheduling of fungible GPUs but that might a wholly separate flow, keeping the initial support narrow. Such services would require manual constraint and device assignment, but you still achieve the goal.

Let's discuss this in the context of the PR.

aluzzardi commented 7 years ago

Thanks @flx42 - I think GPU is definitly something we want to support medium term.

/cc @mgoelzer

flx42 commented 7 years ago

Thanks @aluzzardi, PR created, it's quite basic.

mlhales commented 7 years ago

The --device option is really import for my use case too. I am trying to use swarm to manage 50 Raspberry Pi's to do computer vision, but I need to be able to access /dev/video0 to capture images. Without this option, I'm stuck, and have to manage them without swarm, which is painful.

stevvooe commented 7 years ago

@mlhales We need someone who is willing to workout the issues with --device in a clustered environment and support that solution, rather than just a drive by PR. If you or a colleague want to take this on, that would be great, but this isn't as simple as adding --device.

StefanScherer commented 7 years ago

Using --device=/dev/gpiomem would be great on a RPi swarm to access GPIO on each node without privileged mode.

nazar-pc commented 7 years ago

Using --device=/dev/fuse would be great for mounting FUSE, which isn't currently possible.

StefanScherer commented 7 years ago

We found an easier way for Blinkt! LED strip to use sysfs. Now we can run Blinkt! in docker swarm mode without privileges.

mathiasimmer commented 7 years ago

@StefanScherer is it a proper alternative for using e.g. --device=/dev/mem to access GPIO on a RPi ? Would love to see an example if you would care to share :)

StefanScherer commented 7 years ago

@mathiasimmer For the use-case with Blinkt! LED strip there are only eight RGB LED's. So using sysfs it not time critical for these few LED's. If you want to drive hundreds of them you still need faster GPIO access to have a higher clock rate. But for Blinkt! we have forked the Node.js module and adjusted in in this branch https://github.com/sealsystems/node-blinkt/tree/sysfs. A sample application can be found as well and how to use this forked module as dependency in an own package.json.

aluzzardi commented 7 years ago

/cc @cyli

stevvooe commented 7 years ago

@aluzzardi I think we should resurrect the --device patch. I don't think there is anything in the pipeline that is sophisticated enough to handle proper, cluster-level dynamic resource allocation. Looking back at this issue, there isn't necessarily a model that will work well in all cases (mostly because no one here can seem to enumerate them).

We can always add logic in the scheduler to prevent device contention in the future.

cyli commented 7 years ago

Attempt to add devices to the container spec and plugin spec here: https://github.com/docker/swarmkit/pull/1964

I've no objection to the --device flag - cc @diogomonica ?

diogomonica commented 7 years ago

--device allows any service to escalate privileges. Why would we add this w/out profiles on services?

cyli commented 7 years ago

@diogomonica I thought profiles mainly covered capabilities, etc?

diogomonica commented 7 years ago

@cyli well, if we believe "devices" are easy enough to understand for easy user acceptance then we might not need them, but we should look critically at adding anything that allows escalation of privileges of a container to the cmd-line before we have agood way of informing everything the service will need from a security perspective to the user.

brubbel commented 7 years ago

Also following this. Very interested in access to character devices (/dev/bus/usb/...) in a docker swarm. To help some others until this is supported by docker, a workaround for swarm + usb:

  1. On the (linux) host(s), create a udev rule which creates a symlink to your device (in my case an ftdi device). e.g. /etc/udev/rules.d/99-libftdi.rules SUBSYSTEMS=="usb", ATTRS{idVendor}=="xxxx", ATTRS{idProduct}=="xxxx", GROUP="dialout", MODE="0666", SYMLINK+="my_ftdi", RUN+="/usr/bin/setupdockerusb.sh" Then reload udev rules: sudo udevadm control --reload-rules Upon connect of the usb device, the udev manager will create a symlink /dev/my_ftdi -> /dev/bus/usb/xxx/xxx and execute /usr/bin/setupdockerusb.sh

  2. The /usr/bin/setupdockerusb.sh (ref) This script sets the character device permissions on (the first) container with given image name.

    #!/bin/bash
    USBDEV=`readlink -f /dev/my_ftdi`
    read minor major < <(stat -c '%T %t' $USBDEV)
    if [[ -z $minor || -z $major ]]; then
    echo 'Device not found'
    exit
    fi
    dminor=$((0x${minor}))
    dmajor=$((0x${major}))
    CID=`docker ps --no-trunc -q --filter ancestor=my/imagename|head -1`
    if [[ -z $CID ]]; then
    echo 'CID not found'
    exit
    fi
    echo 'Setting permissions'
    echo "c $dmajor:$dminor rwm" > /sys/fs/cgroup/devices/docker/$CID/devices.allow
  3. Create the docker swarm with following options: docker service create [...] --mount type=bind,source=/dev/bus/usb,target=/dev/bus/usb [...]

  4. Event listener (systemd service): Waits for a container to be started and sets permissions. Run with root permissions on host.

    #!/bin/bash
    docker events --filter 'event=start'| \
    while read line; do
    /usr/bin/setupdockerusb.sh
    done
mort1k commented 7 years ago

will be great to add --devices in swarm service

sudharkrish commented 6 years ago

@flx42 , Can you let us know, if your patch is available for latest docker swarm, i.e. if someone has ported your patch to the latest, docker swarm API 1.24+, where swarmkit is integrated within docker daemon.

flx42 commented 6 years ago

@sudharkrish No, it isn't ported AFAIK.

eyJhb commented 6 years ago

@flx42 what is the current state of this? :)

Cinderhaze commented 6 years ago

Wondering about the current state as well.. I was trying to set up a simple at home swarm environment (so I could manage with a simple yaml file and a docker stack deploy) and was dissapointed to find --device was missing from swarm mode, keeping me from being able to mount my raspberry pi camera via swarm.

vim-zz commented 6 years ago

Adding my use case, my company is deploying IoT sensors, and without support for --device equivalent swarm mode can't be used

allingeek commented 6 years ago

The year was 2018 and people had been waiting to use Swarm Mode in the IoT space for two years. The GitHub issue was dusty and neglected. Confidence and hope faded into frustration and despair. Hundreds wept as they logged into their Azure Device Management dashboard. Big tech had won. The cloud vendor handcuffs clicked - laughing - as they closed around our product wrists.

In the distance a fumbling mob of sales people chanted hyperbolic mantras. A focused and wary engineer can pick-out only one common word, though no two would spell it the same way: Kubernetes. The marketing hellscape is leaking out of the large-scale service software space and laid siege to our precious little corner previously too small for slick enterprise sales people.

A weak but proud voice cries out:

No. We won't go there. Not with the insecure platform composition model, configuration nightmare, dependencies on cloud-focused configuration management tooling, and ridiculous disk and memory footprint. We will not build a dependency on a centralized architecture in our distributed environment. We will not bring our own secret vault backend. We will not build our own internal certificate authorities. If we wanted to build our own platform we'd have just done that. We'd rather just hack something and abandon the whole product space. Glue it together with shell scripts and docker run -dev /dev/video0 my-video-forwarder. Maybe sprinkle in a few SSH tunnels. It would get the job done and be both more simple and maintainable.

What they really wanted was a to use Swarm Mode. To take advantage of its secure foundation, distributed nature, and simplified encrypted networking model. They'd love to take only this single dependency. Maybe there was hope yet for Swarm Mode in the IoT, but they were not going to wait for it. Not forever.

justincormack commented 6 years ago

Contributions are welcome.

On Mon, 4 Jun 2018, 07:17 Jeff Nickoloff, notifications@github.com wrote:

The year was 2018 and people had been waiting to use Swarm Mode in the IoT space for two years. The GitHub issue was dusty and neglected. Confidence and hope faded into frustration and despair. Hundreds wept as they logged into their Azure Device Management dashboard. Big tech had won. The cloud vendor handcuffs clicked - laughing - as they closed around our product wrists.

In the distance a fumbling mob of sales people chanted hyperbolic mantras. A focused and wary engineer can pick-out only one common word, though no two would spell it the same way: Kubernetes. The marketing hellscape is leaking out of the large-scale service software space and laid siege to our precious little corner previously too small for slick enterprise sales people.

A weak but proud voice cries out:

No. We won't go there. Not with the insecure platform composition model, configuration nightmare, dependencies on cloud-focused configuration management tooling, and ridiculous disk and memory footprint. We will not build a dependency on a centralized architecture in our distributed environment. We will not bring our own secret vault backend. We will not build our own internal certificate authorities. If we wanted to build our own platform we'd have just done that. We'd rather just hack something and abandon the whole product space. Glue it together with shell scripts and docker run -dev /dev/video0 my-video-forwarder. Maybe sprinkle in a few SSH tunnels. It would get the job done and be both more simple and maintainable.

What they really wanted was a to use Swarm Mode. To take advantage of its secure foundation, distributed nature, and simplified encrypted networking model. They'd love to take only this single dependency. Maybe there was hope yet for Swarm Mode in the IoT, but they were not going to wait for it. Not forever.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarmkit/issues/1244#issuecomment-394245542, or mute the thread https://github.com/notifications/unsubscribe-auth/AAdcPFqP3FbCWwpYKvVsl5JQGhLtL9gdks5t5ND0gaJpZM4JVo5I .

BretFisher commented 6 years ago

I know it's a poor substitute but could something like doing docker run things inside a service task work?:

services:
  iot-ftw:
    image: docker
    command: "docker run --device xxxx <image>"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    deploy:
      mode: global

or:

https://blog.viktoradam.net/2018/05/14/podlike/

DieterReuter commented 6 years ago

@BretFisher Sure, that's how I'm addressing this "missing feature" in Swarm. It's a workaround or someone would call it a hack, but it works great. Can even run easily as a Global Service. 😇

nazar-pc commented 6 years ago

Running a container that has full access to Docker instance is at very least dangerous.

BretFisher commented 6 years ago

Normally I would agree for app services. However, the only thing in that container with daemon access is the docker image (in my 1st example). The iot app wouldn't have anything elevated. I don't see an increased risk here. You're trusting docker to be safe with docker.

allingeek commented 6 years ago

Just to be clear, I think a workaround like @BretFisher mentioned is a perfectly acceptable temporary solution. Especially so for global service use-cases where all worker nodes have the same requisite device attached. @nazar-pc I'm less concerned about the privilege here as the service itself is not processing user-provided input. But maybe that is naive.

@justincormack I think maybe some of the frustration is the lack of clear direction. There was at least some discussion happening in this thread right up until the end of February 2017. But that left the direction ambiguous. Pairing that will the time and research investment it takes to contribute to any Docker project at this depth and whatever effort people put in on this would almost certainly have to be funded. Perhaps I can convince one of my clients to make the investment. I'm just surprised that Docker hasn't funded this issue yet (especially considering the potential security impact).

@diogomonica Can you elaborate on what you think should be put in place re: your last comment? We're not actually preventing escalation risk by not offering devices in services. People aren't skipping the feature, they're just not using services or their using a work around. I'm not advocating for blind addition of the feature without consideration for security. But I think we need some informed vision to get started.

dperny commented 6 years ago

There are a lot of desired features from Swarmkit. We don't have the peoplepower right now to chase down very many of them, and it's difficult to say from here what is important. I think the least contribution we'd need to work on it would be a design proposal.

Someone has suggested out-of-band that perhaps we provide a power-user field allowing the specification of any flags that Docker supports (including --device), with the understanding that using these flags means your tasks might do weird things or fail in weird ways unless you know what you're doing. This is an ugly and dirty solution which I am not a fan of, but it may be What We Have To Do in order to just make thing work.

A more elegant solution would be building out a system for noting what resources (devices) a node has available, and making scheduling decisions in swarmkit in order to put tasks where resources are available for them. This would be more complicated to build, but would end up likely being easier to use. In fact, some old proto messages that never got implemented (GenericResource) hint at that this design was being pursued as some point, but was abandoned, likely due to available human time constraints.

We've been working on swarmkit a lot, but the velocity has definitely slowed down since its release. Work has been focused on stabilizing the software, and feature development is much slower because there are fewer people working on it right now.

WDYT?

vim-zz commented 6 years ago

I would prefer to have something that allow me to use swarm kit and manage my devices than not, anyway I am doing that atm with or without swarm. Currently I can't use swarm for my company use case because the lack of support for --device.

I think that at the current state, swarm is missing a perfect use case for IoT where I/O from devices is a must, leaving this use case open for other alternatives which has much less fit for that in any other aspect, solely by not allowing to connecting to devices.

Bottom line, your suggestion of having something working for power users is much better than having nothing.

dperny commented 6 years ago

The biggest drawback of power-user flag-pass-through is that those features might interact with swarm in unpredictable ways, and cause really esoteric issues. I'm worried that doing it too haphazardly may increase the support burden.

vim-zz commented 6 years ago

@dperny maybe enable this with some kind of user marking that says i-know-what-i-am-doing like unsafe block or similar?

allingeek commented 6 years ago

It really isn’t Swarm’s job to check that the root user telling it what to do is a “power user.” And I do think modeling devices, advertising those devices and modeling dependencies on devices would be great. You could really do powerful things with this.

cnrmck commented 6 years ago

To me, this is a vital issue for IoT. We can't build the things we need to build without having access to devices. It's further frustrated by the fact that there doesn't exist a way (that I know of) to manage --data-path-addr flags outside of docker swarm. Otherwise, docker-compose could be a simple solution to the issue, at least to manage services deployed on a single device. Right now, it's a catch 22. I can either manage my data path (so that I can send data through my connected cellular device) but I can't access my devices (like the camera), or vice-versa. If anyone has a workaround I would greatly appreciate it. Being able to do all of that through docker swarm would be much better.

Cinderhaze commented 6 years ago

@cnrmck, the workaround above from @bretfisher is one option - https://github.com/docker/swarmkit/issues/1244#issuecomment-394343097

And the option from @brubbel is another... https://github.com/docker/swarmkit/issues/1244#issuecomment-285935430

cnrmck commented 6 years ago

@Cinderhaze Thank you so much for summarizing that. Perhaps your comment should be pinned so that other people can find it. I'll try @BretFisher's solution.

dperny commented 6 years ago

Hey, it's been a while, but y'all should take a look at #2682 that i just opened, which is a proposal for device support in swarm. Tell me what you think.

0xshawn commented 3 years ago

Any changes yet?

And I really need --device for map Intel VPU to docker container.

allfro commented 3 years ago

This is extremely useful for people developing distributed tunneling solutions like using openvpn on a swarm. Access to /dev/net/tun is easy enough to schedule across a cluster.

dzobbe commented 3 years ago

quote this

TeoTN commented 3 years ago

This would be probably also useful for exposing Bluetooth to a IoT manager running in swarm