Closed Jess3Jane closed 7 months ago
@tgross This awfully feels like https://github.com/hashicorp/nomad/commit/45b2c3453249cd9a0d4270dbdd2b99784f31d395 ?
Thank you for the detailed report @Jess3Jane and thank you @apollo13 for the git log spelunking. I was able to verify that reverting that commit does fix the problem.
I suspect we need to guard the DNS config override to explicit cni/
networks, so the bridge
network is not affected, but I'm not sure if this would also revert the intended fix.
https://github.com/hashicorp/nomad/blob/23e4b7c9d23350f9d3bd2707b0d79f413767c438/client/allocrunner/taskrunner/task_runner.go#L1137-L1145
I've placed the issue for further roadmapping.
Yeah I do not think we can solely do this for CNI/ networks. This is needed for the default bridge with transparent proxy as well because then the consul-k8s plugin will provide a DNS server.
On Thu, Mar 21, 2024, at 23:23, Luiz Aoqui wrote:
Thank you for the report @Jess3Jane https://github.com/Jess3Jane and thank you @apollo13 https://github.com/apollo13 for the git log spelunking. I was able to verify that reverting that commit does fix the problem.
I suspect we need to guard the DNS config override to explicit
cni/
networks, so thebridge
network is not affected, but I'm not sure if this would also revert the intended fix. https://github.com/hashicorp/nomad/blob/23e4b7c9d23350f9d3bd2707b0d79f413767c438/client/allocrunner/taskrunner/task_runner.go#L1137-L1145I've placed the issue for further roadmapping.
— Reply to this email directly, view it on GitHub https://github.com/hashicorp/nomad/issues/20174#issuecomment-2013954151, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAT5C5JDED5BUSM45NMUALYZNMUNAVCNFSM6AAAAABFALGUB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJTHE2TIMJVGE. You are receiving this because you were mentioned.Message ID: @.***>
Hi folks! This is certainly a regression caused by the bug fix I did in https://github.com/hashicorp/nomad/pull/20007. I had a comment in that PR:
there are three potential sources for DNS configuration here: the output of CNI plugins, the user-provided DNS config, and whatever is on the client
It turns out there are four! At first glance prioritizing the four different sources would be challenging because of the work I'm working on for transparent proxy (as @apollo13 noted). But a quick check of spew.Dump
shows that the DNS value we get back from mode="bridge"
isn't nil, but it is empty. ~So that's likely a simple bug at task_runner.go#L1131
where that value is never nil, and we should be checking whether it actually has any values.~
I'll do some testing of this and likely get a PR up later today if that proves to be the correct hypothesis. Thanks again @Jess3Jane, @apollo13, and @lgfa29!
Edit: bah that's two different structs, as we convert from a non-pointer CNI DNS result to a Nomad-internal struct pointer, so that's not the problem exactly but I'm sure it's something along those lines. Still investigating.
Ok, I've got it. The cni.Result
returns a slice of DNS entries, and if there's no DNS it returns a single DNS struct that is empty (because it's not a pointer). So there's a bug in the original code in networking_cni.go
around how we handle the fallback in the no-DNS case, but that bug was harmless until #20007 because we always through the DNS value away anyways :facepalm:
Fix should be easy, just working up some unit tests to make sure the behavior is properly exercised as well.
This fix will get shipped in the next release of Nomad, as well as backported to 1.6.x and 1.5.x
Nomad version
Version in which this functionality is broken:
Version in which this functionality is working (I tested down to 1.7.2 and they all behave the same):
Operating system and Environment details
This is a completely fresh digital ocean node (I will note all of the changes I made below). Of note we have also hit this issue on systems of varying other configurations (all Ubuntu 22.04 x64).
Issue
Docker allows you to configure a set of DNS servers to give to each container. In Nomad 1.7.5 if you did not set the DNS settings for a job using the docker driver the containers in that job would have this DNS configuration. In Nomad 1.7.6 the containers instead have the host's default DNS configuration which is often undesirable.
Reproduction steps
https://github.com/containernetworking/plugins/releases/download/v1.0.0/cni-plugins-linux-amd64-v1.0.0.tgz
into/opt/cni/bin
/etc/systemd/system/docker.service.d/overrides.conf
:systemctl daemon-reload
systemctl start docker
systemctl start nomad
nomad job run <job-file.hcl>
Expected Result
If we exec into our task on Nomad 1.7.5 we can observe the correct value of
resolv.conf
Compare this to the result of running a container with docker directly to see that they match:
Actual Result
If you do exactly the same with Nomad
1.7.6
you instead will find that the job has a differentresolv.conf
:The exact values will likely differ on your system but we can confirm that this is the contents of
/run/systemd/resolv/resolv.conf
:Job file (if appropriate)
This happens with all jobs that don't configure any dns settings but the specific job I've used for testing is: