Open shoenig opened 1 year ago
One rather creative idea where I have no idea if it works (I didn't do this for networking yet) is to provide a custom docker runtime that simply wraps runc
and modifies the container config to change the namespace (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md#namespaces).
The runtime hooks might allow for something similar: https://github.com/opencontainers/runtime-spec/blob/main/config.md#createruntime-hooks
Sorry, I am out of clever ideas, I can only offer the crazy ideas above ;)
Those createRuntime
hooks seem promising:
On Linux, for example, they are called after the container namespaces are created, so they provide an opportunity to customize the container (e.g. the network namespace could be specified in this hook).
The definition of createRuntime hooks is currently underspecified and hooks authors, should only expect from the runtime that the mount namespace have been created and the mount operations performed.
Another option might be to work towards directly using containerd instead of dockerd. This would probably allow for plenty of things :) That said the simpler option might be to make the podman plugin a first class citizen and simply promote that?
Crosslinking #19962 for visibility.
So I have been looking a bit into this and I think it would actually be feasible. For quick tests there is https://github.com/awslabs/oci-add-hooks which can be used as a wrapper for runc
to inject the required hooks.
To add the required information to the config annotations can be used https://github.com/opencontainers/runtime-spec/blob/main/config.md#annotations (Docker cli has: --annotation map Add an annotation to the container (passed through to the OCI runtime) (default map[])
)
So assuming nomad would ship a nomad-runc
(one might just symlink to nomad and to the detection in the main binary) on $PATH
then nomad could call the docker plugin via --runtime nomad-runc
(or rather it's API equivalent). Then nomad-runc
would alter the oci config to inject itself as a createRuntime
hook ala:
"createRuntime": [
{
"path": "/usr/bin/nomad-runc",
"args": ["hook"],
"env": [ "network_config=XXX"]
},
],
whereas it would take XXX
from the annotations in the config file that nomad initially sent along via the API equivalent of --annotation-map
(this would contain the precreated network namespace information etc).
Thoughts on the viability of this @tgross, @shoenig? I'd really love to see something like this in nomad since this is a rather annoying issue.
Hi @apollo13! I don't think injecting the hooks is too bad... it's more like "ok what is the hook supposed to do once it's injected" that is underspecified. For example, that excerpt from the docs specifically says
(e.g. the network namespace could be specified in this hook)
But it leaves unsaid how this would be done. We could certainly create a network namespace here, but how does that network namespace ID actually get propagated to the rest of the Docker container lifetime is the missing bit we'd need to figure out.
Hi @tgross, I think this is described here: https://blog.quarkslab.com/digging-into-runtimes-runc.html, which has an example repo here: https://github.com/mihailkirov/runc-article/blob/main/scripts-hooks/createRuntime/runtimeCreate.sh So basically what one seems to be supposed to do is move the runc process into the correct namespace?
Let me check that against some running containers.
I think we can have it even easier, we can write a shim that just alters the namespace in the config.json: https://github.com/opencontainers/runtime-spec/blob/v1.2.0/config-linux.md#namespaces -- I will see if I can get that working with an example docker container :)
Okay @tgross I have got something for you. Put this in your /etc/docker/daemon.json
:
{
"runtimes": {
"nomad": {
"path": "/usr/local/bin/nomad-runc"
}
}
}
With this we can supply a custom runc and don't need to implement a container shim (which is probably much more work).
EDIT: The upside of a container shim would be that we do not need to modify /etc/docker/daemon.json
. But that whole thing would be a persistent process which needs RPC style calls etc: https://github.com/containerd/containerd/blob/main/core/runtime/v2/README.md yikes
/usr/local/bin/nomad-runc
looks like this:
#!/usr/bin/env python3
import json
import os
import sys
def log(*data):
with open("/tmp/log", "a") as log_file:
print(*data, file=log_file)
def modify_bundle(bundle):
config_file = f"{bundle}/config.json"
with open(config_file) as f:
config = json.load(f)
# passed in via `--annotation` to docker run
annotations = config["annotations"]
namespaces = config["linux"]["namespaces"]
for ns in namespaces:
# If there is a network namespace, modify it
# from what I have seen it is always there, even if no path is set
if ns["type"] == "network":
ns["path"] = annotations["network_ns"]
break
else:
# Better be safe than sorry, if there is no network namespace create on
namespaces.append({"type": "network", "path": annotations["network_ns"]})
with open(config_file, "w") as f:
json.dump(config, f)
def main():
# hook into `runc create --bundle /the/bundle/dir`
if "create" in sys.argv and "--bundle":
bundle = sys.argv[sys.argv.index("--bundle") + 1]
modify_bundle(bundle)
# exec back to runc and let it do the heavy lifting
os.execvp("runc", ["runc"] + sys.argv[1:])
if __name__ == "__main__":
try:
main()
except Exception as e:
log(e)
raise
When you now run:
ip netns add test
docker run --name=test --net=host --rm -it --runtime nomad --annotation network_ns=/run/netns/test ubuntu:latest sleep 100
you will see:
lsns -t net|grep test
4026535340 net 1 150650 root 0 /run/netns/test sleep 100
that the sleep process is properly attached to the correct namespace.
A few details:
--net=host
because then docker does the least work (everything else involves a prestart hook from docker that does network magic we don't want). With host networking we simply get: {"type": "network", "path": "/var/run/docker/netns/default"},
in the namespaces which we then override with our namespace.--annotation network_ns=/run/netns/test
to tell runc
via annotations which namespace I want, nomad should be able to pass this in when launching a container.Now, is that viable? :D
Now that my runc wrapper is working I have further questions:
EDIT: I guess MustInitiateNetwork
in the driver config is the key. This way the docker driver will initiate the network. Probably be funny if two drivers want to do this :D
I have a working (!) PR at https://github.com/hashicorp/nomad/pull/20017
The only thing that seems to probably break for now is nvidia and other runtimes, there we probably need a way to relay to the original runtime (probably doable via annotations as well so we know what to exec instead of runc, but it might mean reading /etc/docker/daemon.json to get that information)
/usr/local/bin/nomad-runc looks like this:
So basically we're just editing the container's config.json file and then exec'ing into runc
? Seems mostly reasonable. But having to modify the Docker configuration sounds like it'd be a place where we'd lose a lot of users. As you noted in the PR we'd need to account for wrapping alternate runtimes as well. (A lot of stuff like this looks easy on the surface until you try to cover all users, and then... :crying_cat_face: )
Can the runtime take arguments? Normally the way we'd like to do this kind of thing is to add a hidden subcommand, similar to what we've done for logmon
. That way we don't have to ship a separate binary, worry about the shell PATH, etc. We can call os.Executable()
to get the path to the Nomad binary and hand it the hidden subcommand.
More importantly, do we really need to wrap the runtime, instead of adding this config-modifying script to the createRuntime
hooks? That seems like it'd be a much less involved approach if we could make it work.
- If I have a nomad group with multiple docker tasks and the pause container infra image, how does nomad then use that to attach it to the nomad bridge? (can you link me to the relevant code?)
- If I add a non-docker task (let's say raw_exec or exec types) to the group, which network ns will that task join? The docker managed namespace of the pause container? (again please links to the code :D)
The answer to both questions can be found in network_manager_linux.go#L83-L95
which replaces the default network manager with a drivers.DriverNetworkManager
if that MustInitiateNetwork
is set.
Probably be funny if two drivers want to do this :D
Right! We return an error!
/usr/local/bin/nomad-runc looks like this:
So basically we're just editing the container's config.json file and then exec'ing into
runc
? Seems mostly reasonable. But having to modify the Docker configuration sounds like it'd be a place where we'd lose a lot of users.
I guess so, but as my code shows in the PR the behavior can be made optional. Users could opt into this (and I guess users loosing their network in containers might love to get it stable).
Can the runtime take arguments?
Yes, see https://docs.docker.com/reference/cli/dockerd/#configure-runc-drop-in-replacements
Normally the way we'd like to do this kind of thing is to add a hidden subcommand, similar to what we've done for
logmon
. That way we don't have to ship a separate binary, worry about the shell PATH, etc. We can callos.Executable()
to get the path to the Nomad binary and hand it the hidden subcommand.
You'd still need to put the full path into daemon.json though if it is not on the path.
More importantly, do we really need to wrap the runtime, instead of adding this config-modifying script to the
createRuntime
hooks? That seems like it'd be a much less involved approach if we could make it work.
I haven't found a way to modify the the createRuntime
hooks (which are OCI) without changing the runtime to get called at the proper stages. Whether you then change the config.json to directly inject the namespace or inject createRuntime
is an implementation detail (but you are changing config.json nevertheless, unless I missed something) gg Wrapping the runtime is not as bad as it sounds though.
which replaces the default network manager with a
drivers.DriverNetworkManager
if thatMustInitiateNetwork
is set.
Yeah, figured that out after posting (my PR utilizes this to make it conditional)
Probably be funny if two drivers want to do this :D
Right! We return an error!
:+1:
Relevant? moby/moby#47828
@zackherbert Kinda :) I mean in the end the best solution would be if docker "simply" allowed joining an existing namespace. I am not sure though if it actually helps. For it to work in nomad we would need to have each container to be able to join a different namespace (sure, we could take your alternative number 3 and create a new bridge for every container but that probably doesn't scale). I still wonder if a networking plugin might do the trick. But as you noted, docs are rather sparse.
For it to work in nomad we would need to have each container to be able to join a different namespace
The goal is to put everything in the same network namespace, isn't it? Networks are configured at the group level.
@tgross Yes, but containers from different jobs would need to be in different namespaces, no? Or even within a job different groups would have different namespaces.
Right. Each allocation of a job gets its own network namespace (an allocation is defined by the group
block, and much of the time allocations from the same job would end up on different hosts anyways). So if Docker could "simply" support creating containers in an existing network namespace, we'd implement that as follows:
So effectively we'd be removing a lot more code than we'd be adding :grin:
So if Docker could "simply" support creating containers in an existing network namespace
Well yes, but that is still a "maybe" even in a distant future :)
FWIW I looked into the network plugin docs and it doesn't look like a networking plugin would help. The join method (https://github.com/moby/moby/blob/master/libnetwork/docs/remote.md#join) reads as if libnetwork wants to move the iface to the namespace itself:
The entries in InterfaceName represent actual OS level interfaces that should be moved by LibNetwork into the sandbox
When a task group is using
"bridge"
networking mode, Nomad needs a network namespace shared by all tasks in the group. These tasks are not necessarily all using the same task driver (e.g. could beexec
+docker
, etc.)Unfortunately once the Docker task driver gets involved this becomes tricky, because Docker explicitly does not support creating a container in a pre-defined network namespace (https://github.com/moby/moby/issues/7455). So unlike with the
exec
driver where we use normal Linux tooling to manage and join tasks to a shared network namespace, we are at the mercy of what Docker tooling enables.The
docker network create
tool AFAICT has no support for creating a network namespace. Instead each container in Docker always gets created with its own network namespace, and then you can "link" one container to another afterwords. This is where thepause
container comes in. Because Docker ties the network namespace to a container (i.e. running process), Nomad needs a container that will not be stopped/replaced throughout the lifespan of the task group.The problem of course, is that Bad Things can still happen to that pause container, and that is believed to be the source of most occurrences of https://github.com/hashicorp/nomad/issues/6385. While we can make improvements around cleaning up those orphaned resources, there's a lingering desire to just eliminate this whole class of problems outright.
I'm not sure what that means for users of the
docker
driver, unless we find a clever solution to start Docker containers in an existing network namespace, one that is created and managed by Nomad like we do for other task drivers. One idea would be to better invest in and promote the podman driver, which can also run Docker containers while being compatible with normal Linux tools and conventions.