idea: eliminate the use of "pause" containers for docker networking

shoenig commented 1 year ago

When a task group is using "bridge" networking mode, Nomad needs a network namespace shared by all tasks in the group. These tasks are not necessarily all using the same task driver (e.g. could be exec + docker, etc.)

Unfortunately once the Docker task driver gets involved this becomes tricky, because Docker explicitly does not support creating a container in a pre-defined network namespace (https://github.com/moby/moby/issues/7455). So unlike with the exec driver where we use normal Linux tooling to manage and join tasks to a shared network namespace, we are at the mercy of what Docker tooling enables.

The docker network create tool AFAICT has no support for creating a network namespace. Instead each container in Docker always gets created with its own network namespace, and then you can "link" one container to another afterwords. This is where the pause container comes in. Because Docker ties the network namespace to a container (i.e. running process), Nomad needs a container that will not be stopped/replaced throughout the lifespan of the task group.

The problem of course, is that Bad Things can still happen to that pause container, and that is believed to be the source of most occurrences of https://github.com/hashicorp/nomad/issues/6385. While we can make improvements around cleaning up those orphaned resources, there's a lingering desire to just eliminate this whole class of problems outright.

I'm not sure what that means for users of the docker driver, unless we find a clever solution to start Docker containers in an existing network namespace, one that is created and managed by Nomad like we do for other task drivers. One idea would be to better invest in and promote the podman driver, which can also run Docker containers while being compatible with normal Linux tools and conventions.

apollo13 commented 1 year ago

One rather creative idea where I have no idea if it works (I didn't do this for networking yet) is to provide a custom docker runtime that simply wraps runc and modifies the container config to change the namespace (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#namespaces).

The runtime hooks might allow for something similar: https://github.com/opencontainers/runtime-spec/blob/main/config.md#createruntime-hooks

Sorry, I am out of clever ideas, I can only offer the crazy ideas above ;)

tgross commented 1 year ago

Those createRuntime hooks seem promising:

On Linux, for example, they are called after the container namespaces are created, so they provide an opportunity to customize the container (e.g. the network namespace could be specified in this hook).

The definition of createRuntime hooks is currently underspecified and hooks authors, should only expect from the runtime that the mount namespace have been created and the mount operations performed.

apollo13 commented 1 year ago

Another option might be to work towards directly using containerd instead of dockerd. This would probably allow for plenty of things :) That said the simpler option might be to make the podman plugin a first class citizen and simply promote that?

apollo13 commented 4 months ago

Crosslinking #19962 for visibility.

apollo13 commented 4 months ago

So I have been looking a bit into this and I think it would actually be feasible. For quick tests there is https://github.com/awslabs/oci-add-hooks which can be used as a wrapper for runc to inject the required hooks.

To add the required information to the config annotations can be used https://github.com/opencontainers/runtime-spec/blob/main/config.md#annotations (Docker cli has: --annotation map Add an annotation to the container (passed through to the OCI runtime) (default map[]))

So assuming nomad would ship a nomad-runc (one might just symlink to nomad and to the detection in the main binary) on $PATH then nomad could call the docker plugin via --runtime nomad-runc (or rather it's API equivalent). Then nomad-runc would alter the oci config to inject itself as a createRuntime hook ala:

    "createRuntime": [
        {
            "path": "/usr/bin/nomad-runc",
            "args": ["hook"],
            "env":  [ "network_config=XXX"]
        },
    ],

whereas it would take XXX from the annotations in the config file that nomad initially sent along via the API equivalent of --annotation-map (this would contain the precreated network namespace information etc).

Thoughts on the viability of this @tgross, @shoenig? I'd really love to see something like this in nomad since this is a rather annoying issue.

tgross commented 4 months ago

Hi @apollo13! I don't think injecting the hooks is too bad... it's more like "ok what is the hook supposed to do once it's injected" that is underspecified. For example, that excerpt from the docs specifically says

(e.g. the network namespace could be specified in this hook)

But it leaves unsaid how this would be done. We could certainly create a network namespace here, but how does that network namespace ID actually get propagated to the rest of the Docker container lifetime is the missing bit we'd need to figure out.

apollo13 commented 4 months ago

Hi @tgross, I think this is described here: https://blog.quarkslab.com/digging-into-runtimes-runc.html, which has an example repo here: https://github.com/mihailkirov/runc-article/blob/main/scripts-hooks/createRuntime/runtimeCreate.sh So basically what one seems to be supposed to do is move the runc process into the correct namespace?

Let me check that against some running containers.

apollo13 commented 4 months ago

I think we can have it even easier, we can write a shim that just alters the namespace in the config.json: https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config-linux.md#namespaces -- I will see if I can get that working with an example docker container :)

apollo13 commented 4 months ago

Okay @tgross I have got something for you. Put this in your /etc/docker/daemon.json:

{
  "runtimes": {
    "nomad": {
      "path": "/usr/local/bin/nomad-runc"
    }
  }
}

With this we can supply a custom runc and don't need to implement a container shim (which is probably much more work).

EDIT: The upside of a container shim would be that we do not need to modify /etc/docker/daemon.json. But that whole thing would be a persistent process which needs RPC style calls etc: https://github.com/containerd/containerd/blob/main/core/runtime/v2/README.md yikes

/usr/local/bin/nomad-runc looks like this:

#!/usr/bin/env python3
import json
import os
import sys

def log(*data):
    with open("/tmp/log", "a") as log_file:
        print(*data, file=log_file)

def modify_bundle(bundle):
    config_file = f"{bundle}/config.json"
    with open(config_file) as f:
        config = json.load(f)
    # passed in via `--annotation` to docker run
    annotations = config["annotations"]
    namespaces = config["linux"]["namespaces"]
    for ns in namespaces:
        # If there is a network namespace, modify it
        # from what I have seen it is always there, even if no path is set
        if ns["type"] == "network":
            ns["path"] = annotations["network_ns"]
            break
    else:
        # Better be safe than sorry, if there is no network namespace create on
        namespaces.append({"type": "network", "path": annotations["network_ns"]})
    with open(config_file, "w") as f:
        json.dump(config, f)

def main():
    # hook into `runc create --bundle /the/bundle/dir`
    if "create" in sys.argv and "--bundle":
        bundle = sys.argv[sys.argv.index("--bundle") + 1]
        modify_bundle(bundle)
    # exec back to runc and let it do the heavy lifting
    os.execvp("runc", ["runc"] + sys.argv[1:])

if __name__ == "__main__":
    try:
        main()
    except Exception as e:
        log(e)
        raise

When you now run:

ip netns add test
docker run --name=test --net=host --rm -it --runtime nomad --annotation network_ns=/run/netns/test ubuntu:latest sleep 100

you will see:

lsns -t net|grep test
4026535340 net       1 150650 root             0 /run/netns/test           sleep 100

that the sleep process is properly attached to the correct namespace.

A few details:

I am running with --net=host because then docker does the least work (everything else involves a prestart hook from docker that does network magic we don't want). With host networking we simply get: {"type": "network", "path": "/var/run/docker/netns/default"}, in the namespaces which we then override with our namespace.
I am using --annotation network_ns=/run/netns/test to tell runc via annotations which namespace I want, nomad should be able to pass this in when launching a container.
I am using 100% well documented behavior (see https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config-linux.md#namespaces)

Now, is that viable? :D

apollo13 commented 4 months ago

Now that my runc wrapper is working I have further questions:

If I have a nomad group with multiple docker tasks and the pause container infra image, how does nomad then use that to attach it to the nomad bridge? (can you link me to the relevant code?)
If I add a non-docker task (let's say raw_exec or exec types) to the group, which network ns will that task join? The docker managed namespace of the pause container? (again please links to the code :D)

EDIT: I guess MustInitiateNetwork in the driver config is the key. This way the docker driver will initiate the network. Probably be funny if two drivers want to do this :D

apollo13 commented 4 months ago

I have a working (!) PR at https://github.com/hashicorp/nomad/pull/20017

The only thing that seems to probably break for now is nvidia and other runtimes, there we probably need a way to relay to the original runtime (probably doable via annotations as well so we know what to exec instead of runc, but it might mean reading /etc/docker/daemon.json to get that information)

tgross commented 4 months ago

/usr/local/bin/nomad-runc looks like this:

So basically we're just editing the container's config.json file and then exec'ing into runc? Seems mostly reasonable. But having to modify the Docker configuration sounds like it'd be a place where we'd lose a lot of users. As you noted in the PR we'd need to account for wrapping alternate runtimes as well. (A lot of stuff like this looks easy on the surface until you try to cover all users, and then... :crying_cat_face: )

Can the runtime take arguments? Normally the way we'd like to do this kind of thing is to add a hidden subcommand, similar to what we've done for logmon. That way we don't have to ship a separate binary, worry about the shell PATH, etc. We can call os.Executable() to get the path to the Nomad binary and hand it the hidden subcommand.

More importantly, do we really need to wrap the runtime, instead of adding this config-modifying script to the createRuntime hooks? That seems like it'd be a much less involved approach if we could make it work.

If I have a nomad group with multiple docker tasks and the pause container infra image, how does nomad then use that to attach it to the nomad bridge? (can you link me to the relevant code?)

If I add a non-docker task (let's say raw_exec or exec types) to the group, which network ns will that task join? The docker managed namespace of the pause container? (again please links to the code :D)

The answer to both questions can be found in network_manager_linux.go#L83-L95 which replaces the default network manager with a drivers.DriverNetworkManager if that MustInitiateNetwork is set.

Probably be funny if two drivers want to do this :D

Right! We return an error!

apollo13 commented 4 months ago

/usr/local/bin/nomad-runc looks like this:

So basically we're just editing the container's config.json file and then exec'ing into runc? Seems mostly reasonable. But having to modify the Docker configuration sounds like it'd be a place where we'd lose a lot of users.

I guess so, but as my code shows in the PR the behavior can be made optional. Users could opt into this (and I guess users loosing their network in containers might love to get it stable).

Can the runtime take arguments?

Yes, see https://docs.docker.com/reference/cli/dockerd/#configure-runc-drop-in-replacements

Normally the way we'd like to do this kind of thing is to add a hidden subcommand, similar to what we've done for logmon. That way we don't have to ship a separate binary, worry about the shell PATH, etc. We can call os.Executable() to get the path to the Nomad binary and hand it the hidden subcommand.

You'd still need to put the full path into daemon.json though if it is not on the path.

More importantly, do we really need to wrap the runtime, instead of adding this config-modifying script to the createRuntime hooks? That seems like it'd be a much less involved approach if we could make it work.

I haven't found a way to modify the the createRuntime hooks (which are OCI) without changing the runtime to get called at the proper stages. Whether you then change the config.json to directly inject the namespace or inject createRuntime is an implementation detail (but you are changing config.json nevertheless, unless I missed something) gg Wrapping the runtime is not as bad as it sounds though.

which replaces the default network manager with a drivers.DriverNetworkManager if that MustInitiateNetwork is set.

Yeah, figured that out after posting (my PR utilizes this to make it conditional)

Probably be funny if two drivers want to do this :D

Right! We return an error!

:+1:

zackherbert commented 1 month ago

Relevant? moby/moby#47828

apollo13 commented 1 month ago

@zackherbert Kinda :) I mean in the end the best solution would be if docker "simply" allowed joining an existing namespace. I am not sure though if it actually helps. For it to work in nomad we would need to have each container to be able to join a different namespace (sure, we could take your alternative number 3 and create a new bridge for every container but that probably doesn't scale). I still wonder if a networking plugin might do the trick. But as you noted, docs are rather sparse.

tgross commented 1 month ago

For it to work in nomad we would need to have each container to be able to join a different namespace

The goal is to put everything in the same network namespace, isn't it? Networks are configured at the group level.

apollo13 commented 1 month ago

@tgross Yes, but containers from different jobs would need to be in different namespaces, no? Or even within a job different groups would have different namespaces.

tgross commented 1 month ago

Right. Each allocation of a job gets its own network namespace (an allocation is defined by the group block, and much of the time allocations from the same job would end up on different hosts anyways). So if Docker could "simply" support creating containers in an existing network namespace, we'd implement that as follows:

The allocrunner's network hook creates the network namespace (exists today for non-Docker workloads)
Nomad creates the veth, iptables rules, etc. between the network namespace and the bridge (exists today for non-Docker workloads)
We pass the path to that network namespace to Docker
Done!

So effectively we'd be removing a lot more code than we'd be adding :grin:

apollo13 commented 1 month ago

So if Docker could "simply" support creating containers in an existing network namespace

Well yes, but that is still a "maybe" even in a distant future :)

FWIW I looked into the network plugin docs and it doesn't look like a networking plugin would help. The join method (https://github.com/moby/moby/blob/master/libnetwork/docs/remote.md#join) reads as if libnetwork wants to move the iface to the namespace itself:

The entries in InterfaceName represent actual OS level interfaces that should be moved by LibNetwork into the sandbox

hashicorp / nomad

idea: eliminate the use of "pause" containers for docker networking #15086