Extending Spegel to Nomad Docker clusters

stenh0use commented 10 months ago

Hey, I really love the simple implementation of this service, I am looking for something to back GCR / AR registries without the operational overhead of running redis and postgres and I think Spegel is exactly what I am looking for!

I'd like to extend this to non kubernetes docker clusters, would you be open to adding functionality so that Spegel can be bootstrapped without kubernetes? I had a quick look over the source code and could only see the need for kubernetes in the bootstrapping section. If I do the leg work would you be interested in working with me to integrate Consul based bootstrapping into Spegel?

phillebaba commented 10 months ago

We will be adding more bootstrappers as part of the work to integrate with k3s, which means that in theory it should be possible. Are you planning on using the KV store in Consul to share the public key?

Depending on how your container platform is designed it might be more interesting for you to import Spegel as a library in the same way that k3s will?

stenh0use commented 10 months ago

I only briefly looked at the kubernetes bootstrapping code so I may be being naive here. I was thinking a similar leader election process would work with Consul as the KV backend for locking and choosing the initial leader. I'm using Nomad and Consul, so was looking to run Spegel as a system job on Nomad to handle caching and sharing existing docker images across nodes.

I did see some mention of Spegel in k3s the other day, but didn't dive into the implementation details. Given that you say it will be embedded as a library it probably wouldn't be right for me unless Hashicorp were to accept it into their project.

We will be adding more bootstrappers as part of the work to integrate with k3s

What bootstrapping methods are you planning for the k3s integration?

phillebaba commented 10 months ago

I am back from the holidays now so should be a bit faster to respond.

I think that adding support for Nomad would be great to expand the user base. It has been a while since I have used Nomad so had a look at the different container drivers out there.

Before we dive into looking at bootstrappers we need to verify that one of these drivers will work with Spegel. The main issue is that Spegel relies on CRI for the mirror configuration to work. Check how Containerd implements its CRI server.

https://github.com/containerd/containerd/blob/c98cb4af223348b78fc3b8c09762bc79983670b0/pkg/cri/server/images/image_pull.go#L132-L135

The Containerd driver does not implement any support for CRI mirror configuration.

https://github.com/Roblox/nomad-driver-containerd/blob/15d14253688c1d5c349c26c1ba407d7e7831bd5d/containerd/containerd.go#L96-L116

It looks like this is also the case with the podman driver.

https://github.com/hashicorp/nomad-driver-podman/blob/89d6a0bde7cd3dd64beb715ad9ebc031ff93b793/api/image_pull.go#L18-L70

@stenh0use have i missed some driver that you are using? I think we need to prove that Spegel will work on your Nomad setup before looking more at how to bootstrap Spegel.

stenh0use commented 10 months ago

Yeah sane thought process there, I'm using the builtin docker driver. The driver interfaces with dockerd and my assumption that it would work was based on dockerd using containerd as the runtime. This assumption seems to be wrong as I have since found that dockerd is only using the containerd runtime and is not using the image store.

So after looking through dockerd today I'm not sure how confident I am that Spegel will just work, although it looks like v24 implemented experimental support for enabling containerd as the image store.

https://github.com/moby/moby/issues/38043

There are still 20 outstanding issues attached to this issue for "fix remaining failing tests with the containerd image store" so hopefully it's not too far away from graduating from experimental to supported.

phillebaba commented 10 months ago

Oh there is a third driver, how did i miss the built in driver?

I had a look at how docker does registry mirroring, and it is limited. Configuring the Docker daemon is simple enough, and just requires a restart of the daemon. The problem is that this will first of all mirror all image pulls, meaning it will not be possible to exclude registries. Second of all Docker does not include any reference to the original registry in its requests, which makes resolving tags impossible.

https://docs.docker.com/docker-hub/mirror/#configure-the-docker-daemon

I am a bit stuck right now. We need to figure out how to enable tag resolving for Docker. Spegel would work on Nomad with the Docker driver if we figure that out.

rumpl commented 10 months ago

There are still 20 outstanding issues attached to this issue for "fix remaining failing tests with the containerd image store" so hopefully it's not too far away from graduating from experimental to supported.

The only remaining issues are for the (somewhat deprecated) classic builder, and these issues are the cache not working, but the build works. I guess what I'm saying is, give this a try, tell us if something breaks :)

Here's how to enable the containerd image store feature https://docs.docker.com/storage/containerd/

phillebaba commented 10 months ago

@rumpl thanks for the input, I am unsure if using Containerd image store would solve this problem. Spegel relies on Containerds CRI implementation to supporting registry mirroring. Using another Snapshotter would not solve this problem as the image would still be pulled without the CRI API.

stenh0use commented 10 months ago

Check how Containerd implements its CRI server. https://github.com/containerd/containerd/blob/c98cb4af223348b78fc3b8c09762bc79983670b0/pkg/cri/server/images/image_pull.go#L132-L135

I looked at the docker source code and It looks like it is using a different ImageService when containerd-snapshotter is set. The image pull resolver is defined how you linked to in the containerd source code.

https://github.com/moby/moby/blob/9cebefa7175c849a0fb89be9a2c0c23755afb3e2/daemon/daemon.go#L1089-L1097 https://github.com/moby/moby/blob/9cebefa7175c849a0fb89be9a2c0c23755afb3e2/daemon/containerd/image_pull.go#L70 https://github.com/moby/moby/blob/9cebefa7175c849a0fb89be9a2c0c23755afb3e2/daemon/containerd/resolver.go#L28-L32

Although I'm not entirely sure if this solves the problem?

phillebaba commented 10 months ago

Good news, after a lot of tinkering and going through code I think I have figured it out. Using the Containerd snapshotter together with configuring the mirror in /etc/docker/daemon.json results in a HTTP request identical to one received when pulling using Containerd. The only downside with using Docker is that it is not possible to limit mirroring of only specific registries, but that is more on Docker than it is on Spegel.

I think we should be able to move forward with this feature. The next step is to determine the best method of running Spegel in Nomad. The simplest should be to run it in a Docker container.

stenh0use commented 10 months ago

Great news and thanks for tinkering! I think I'm ok with the downside that it's not possible to limit the mirroring of specific registries so long as it can mirror gcp gcr/ar registries.

The best way to run it I would think is in a Docker container as system job, it's similar to a DaemonSet.

phillebaba commented 10 months ago

I need to setup a test Nomad cluster to see how networking works, among other things. After that I should be able to figure out how bootstrapping should look like.

stenh0use commented 10 months ago

Can I help you some how? I was thinking either host or bridge network would work with a static port as a system job, similar to how you've done it in kubernetes. The metrics port can be dynamic and registered as a service in consul for prometheus service discovery.

https://developer.hashicorp.com/nomad/docs/job-specification/network#mode https://developer.hashicorp.com/nomad/docs/schedulers#system

I have a WIP for hashistack in docker: https://github.com/stenh0use/hind

I have locally updated the docker-ce version and was able to get containerd-snapshotter working for docker pull, but docker run was having problems mounting volumes in my dind setup (I need to figure that out). I can clean that up today and push it if that is helpful?

root@9ecd8301374e:/# ctr --namespace moby images ls
REF                                  TYPE                                    DIGEST                                                                  SIZE     PLATFORMS                                                                                                                                           LABELS 
docker.io/library/hello-world:latest application/vnd.oci.image.index.v1+json sha256:ac69084025c660510933cca701f615283cdbb3aa0963188770b54c31c8962493 12.7 KiB linux/386,linux/amd64,linux/arm/v5,linux/arm/v7,linux/arm64/v8,linux/mips64le,linux/ppc64le,linux/riscv64,linux/s390x,unknown/unknown,windows/amd64 -      
docker.io/library/redis:7            application/vnd.oci.image.index.v1+json sha256:a7cee7c8178ff9b5297cb109e6240f5072cdaaafd775ce6b586c3c704b06458e 49.0 MiB linux/386,linux/amd64,linux/arm/v5,linux/arm/v7,linux/arm64/v8,linux/mips64le,linux/ppc64le,linux/s390x,unknown/unknown 
root@9ecd8301374e:/# docker image ls
REPOSITORY    TAG       IMAGE ID       CREATED        SIZE
hello-world   latest    ac69084025c6   43 hours ago   24.4kB
redis         7         a7cee7c8178f   43 hours ago   204MB

In its current state If you run make build and then make up on my project you should have yourself a test cluster in docker.

I'll update the topic of this issue as we are talking about specifically docker and nomad.

Edit: I got the snapshotter working in the dind setup linked above, I just merged into main the change.

rumpl commented 10 months ago

If I can help don’t hesitate to ping me, I can either help or delegate internally :)

phillebaba commented 10 months ago

@stenh0use a lot has changed in Nomad since the last time I touched it, a lot for the better. I was thinking if we even need Consul to make bootstrapping work? Could we not instead use the nomadService template command together with a static rendevouz hash. That would mean that the same IPs would be returned for all of the instances of Spegel. If I understand things correctly the environment variable should update when the template value updates. Is this statement correct?

https://developer.hashicorp.com/nomad/docs/job-specification/template?_gl=1*121w2wu*_ga*MTIyODA5MzYxMy4xNzA0NDQ4Njgx*_ga_P7S46ZYEKW*MTcwNDc0NTkwNS4xLjEuMTcwNDc0ODU4Ni40My4wLjA.#simple-load-balancing-with-nomad-services

Then as you stated using a static port for the registry should be fine for the mirror to work.

stenh0use commented 10 months ago

I was thinking the same thing over the weekend. I do not think we should involve Consul, if we need Consul kv type functionality Nomad implemented this a few releases ago.

https://developer.hashicorp.com/nomad/api-docs/variables/variables https://developer.hashicorp.com/nomad/api-docs/variables/locks

Regarding nomadService Nomad can inject variables into the config templates about information about the deployment. I wasn't sure if that would work as I thought in the kubernetes bootstrap code it was doing a leader election using distributed locks via leaderelection.LeaderElectionConfig.

If Spegel only needs an initial list of IPs to create the cluster and it handles all of the leader election itself then we might not need to complicate a nomad deployment leader election.

https://developer.hashicorp.com/nomad/docs/job-specification/template#change_mode

Otherwise I was looking at something like this:

https://github.com/razorpay/metro/blob/5eb8881adbf5da6d387d1f4659916c83028dfb06/pkg/leaderelection/candidate.go#L56 https://engineering.razorpay.com/leader-election-using-consul-and-golang-73580fb14463

Edit: to answer the question about template value updates, you can set a restart policy when the template changes. You can set it noop, restart, signal, script, with these the signal option in particular, you can configure what signal to send to the process.

phillebaba commented 10 months ago

Leader election is not actually needed. The reason it is used in Kubernetes is to make sure all nodes bootstrap with the same instance. We should be able to do the same without it using the identify protocol to distribute public keys.

I tried running Hind on Linux and I get some build issues, will have to look at why it will not build for x86 or I will just find and alternative method of running a local multi node Nomad cluster.

stenh0use commented 10 months ago

Ok good to know about the Leader election we can definitely pass in any the same node address on startup. I'm wondering how would bootstrapping work when a new node joins the cluster or a node fails? Can it then join the cluster based on any other node address? Given the statelessness I guess if we get into a split cluster situation we can always stop and restart the job.

That is annoying about hind, what is the error you are getting? I will spin up a linux box look into fixing it, a friend said the said to me today. I have only tested hind on my laptop which is x86 Macbook using colima 0.6.x, it also requires the docker host to be using cgroupv2.

phillebaba commented 10 months ago

I have a working Nomad cluster running with Vagrant now, and managed to get Spegel running without a bootstrap. My plan is to create a draft PR with the instructions and then you can have a look at it and give feedback. Would that work for you?

stenh0use commented 10 months ago

Thank you so much @phillebaba! Plan sounds great with the draft PR, let me know once you have that and I'll take a look.

RoryDoherty commented 8 months ago

Is there any documentation or a rough guide of how you set this up to help with dind? My use case is that I have pods that spin up a container with a dind sidecar to allow docker commands to execute in the main container Any time it has to pull an image the sidecar is new so it is pulling it direct from the web, even though the image may already be on the kubernetes node itself or on another node Is this something that spegel can help with based on the above improvements?

stenh0use commented 7 months ago

@phillebaba thanks for the updates here. Apologies, life has got in the way and I'm yet to test the new changes. I made a rough nomad job file to get this working a while back based off the helm chart, but need more time to incorporate the changes.

stenh0use commented 7 months ago

@RoryDoherty you might be better off creating a new issue. Your architecture and where the image is meant to live would need to be understood in order to answer that question.

stenh0use commented 7 months ago

So I tested this out on Nomad, I wasn't able to get it working using bridge networking, but I was able to get it working using host networking.

This is due to container IP address being advertised from the bootstrap server for the router to connect to. As everything is running on private addresses the peer routers can't be reached. This wouldn't be so much of a problem with overlay networking like calico and cilium. Alternatively, an option to configure an "advertised" address as well as the listen address might work? For now I think host networking should get the job done.

When using bridge network

docker exec -it hind.nomad.client.01 curl http://192.168.32.4:30738/id
/ip4/172.17.0.2/tcp/5001/p2p/12D3KooWNnh9pmRkPdYHTpDCKEQgczy2EodxSMQ6ystwJuG1eDPb

When using host network

docker exec -it hind.nomad.client.01 curl http://192.168.32.4:22143/id
/ip4/192.168.32.4/tcp/5001/p2p/12D3KooWPGnsCBBAsMNkW9Nr1idF7irR7sDtktMyv22hyR27PZao

I need to clean up my wip, but will post back here once I have a good reference. I mostly copied the helm chart but I'm still a little unsure as to how the "service" address would work in Nomad, and also what significance the "local" address has/should be configured.

For a load-balanced "service" address, consul DNS would work well, but unfortunately, you can't register a nomad job as both a consul service and a nomad service. So to do the nomadService rendevous hashing for the bootstrap node selection nomad service discovery has to be used.

stenh0use commented 6 months ago

Update here:

I created a repo nomad-spegel with my work. It includes 3 options for leader election, nomadService with rendevous hashing, nomad kv locking and consul kv locking, and options to use nomad or consul as a service discovery backend.

After doing a lot of testing I found nomadService rendevous hashing more flakey than using nomad/consul kv locking. However I think I managed to get it to a stable state after adding multiple services/ports to the same service name in nomad. Originally I had each interface as a separate service (spegel-<service>), but it caused allocations to drop in and out which meant the bootstrap addr in the template kept flapping signaling to the registry to restart frequently.

The kv/locking with consul/nomad binary works well, but perhaps it might be nice integrate the consul/nomad kv functionality as an alternative bootstrapper at a later stage. For now what I have seems to work well.

I do have some follow up questions / issues that I wasn't able to figure out.

how does leadership work within the cluster after startup?

Once the cluster is established does the cluster maintain leadership via gossip or does the bootstrap/id as something that only can be updated on startup? eg. if the cluster already exists can a peer bootstrap with any member of the cluster? I ask this as I am restarting all registries and forcing a new bootstrap process everytime the leadership changes. The benifit of this at least means that the cluster will never have split rings in the event nodes bootstrapped with different sets of hosts.

local address "--local-addr=192.168.112.4:25565" "--local-addr=:25565"doesn't listen, or at least in my testing I couldn't get it to listen.

Click to expand logs

curl -v 192.168.112.4:25565 * Trying 192.168.112.4:25565... * connect to 192.168.112.4 port 25565 failed: Connection refused * Failed to connect to 192.168.112.4 port 25565: Connection refused * Closing connection 0 curl: (7) Failed to connect to 192.168.112.4 port 25565: Connection refused

as commented above, host networking or an overlay network needs to be used due to the way the bootstrap/router ip is advertised.
Not really a question, just a statement about docker support: It looks like docker doesn't support mirroring registries other than docker.io. I need to dig further into that, but I think I recall reading about that a while ago. I was able to confirm the issue by using nerdctl vs docker cli. The nomad logs received the same error as the docker cli.

Click to expand logs

# with an image on node 1 - internet disconnected (from node 3) # Fails with docker root@hind:/# docker pull ghcr.io/curl/curl-container/curl-dev-debian:master Error response from daemon: failed to resolve reference "ghcr.io/curl/curl-container/curl-dev-debian:master": failed to do request: Head "https://ghcr.io/v2/curl/curl-container/curl-dev-debian/manifests/master": dial tcp: lookup ghcr.io on 127.0.0.53:53: no such host # Succeeds with nerdctl root@hind:/# ./nerdctl --namespace moby pull ghcr.io/curl/curl-container/curl-dev-debian:master ghcr.io/curl/curl-container/curl-dev-debian:master: resolved |++++++++++++++++++++++++++++++++++++++| manifest-sha256:017b38c8c1774a8936c36738831733c82ac7b92756392c08b55312aac1b78ffd: waiting |--------------------------------------| config-sha256:e3c3e70745d5baf55edd5eabad6620ad7f3a77fa3410409b2f0e595a80e7c3fe: done |++++++++++++++++++++++++++++++++++++++| layer-sha256:c2df7039d217246c2c69539feca226b0ab648b50b3534ac17db3627ca6ea3a2a: downloading |+++++++++++++++++++++++++++++---------| 244.0 Mi/318.6 MiB layer-sha256:a6b88e165427e35f2ed91be78903ee4426fbdb43238695e405e4ce85bb93aaf7: done |++++++++++++++++++++++++++++++++++++++| elapsed: 39.9s total: 292.8 (7.3 MiB/s) # with an docker hum image on node 1 - internet disconnected (from node 3) root@hind:/# docker pull redis:7 7: Pulling from library/redis c16c264be546: Download complete cb9709829e8b: Download complete 214d0afb35ca: Download complete 16a9d12e7a2c: Download complete f7ebca356832: Download complete 00e912971fa2: Download complete 4f4fb700ef54: Already exists Digest: sha256:f14f42fc7e824b93c0e2fe3cdf42f68197ee0311c3d2e0235be37480b2e208e6 Status: Downloaded newer image for redis:7 docker.io/library/redis:7

stenh0use commented 6 months ago

@rumpl do you know if there are any plans to fix this issue https://github.com/moby/moby/issues/18818 as part of the containerd snapshotter work? I was able to pull non dockerhub images via spegel using nerdctl but but not with the docker daemon/cli.

stenh0use commented 6 months ago

After reading a bunch of PRs/issues on the moby page, it doesn't look like the mirror issue ever progressed/doesn't look like the feature is on the cards given the age of the above issue (unless @rumpl can provide any insights or updates there?).

The good news is though it's fairly straightforward to update the dockerd code to support private registry mirrors. I compiled a custom dockerd binary tonight with a change to support ghcr.io and was able to pull non dockerhub images through spegel. I'm not super thrilled about having to patch every release, hopefully either moby can deliver on the feature or the containerd plugin for nomad gets new maintainership.

https://github.com/Roblox/nomad-driver-containerd/issues/167

spegel-org / spegel

Extending Spegel to Nomad Docker clusters #303