Closed ygersie closed 1 year ago
Hi @ygersie
Thanks for reporting this issue, and for providing a potential solution! I'll take a look at what you've got here and then discuss where it might fit in the roadmap. Also, feel free to submit a PR to add the configuration. You might get to it more quickly than we do, and community PRs are always welcome!
Hi @DerekStrickland
Thanks for the update. Yeah, I'd like to mainly confirm that this wouldn't cause any adverse side effects. Afaict there should not be any implications. If you guys agree I'm happy to push the change to make it a default setting.
I don't know that we could set it as the default. Some quick research seems to indicate that not all CNI implementations support it. I think the PR would need to default to false but allow user configuration to enable hairpin mode.
@DerekStrickland it may not be supported by all CNI implementations but this is specifically the one used to setup:
network {
mode = "bridge"
}
which is used to setup port forwarding using the CNI plugins and also required when using Consul Connect. Afaik it won't affect any other (user supplied) configurations. But if there might be other issues then it definitely needs to be configurable. Please let me know what you think.
Hi @ygersie, hope you're doing well!
I think another thing to keep in mind here is the backwards compatibility and behaviour consistency when updating the built-in CNI configuration. I wonder if we could add a new client configuration parameter similar to client.bridge_network_name
and client.bridge_network_subnet
named client.bridge_harpin_mode
which defaults to false, but allows easy setting if desired?
This seems to be required to run most clustering applications. It seems to be a common pattern that these applications connect to all nodes in their cluster including themselves.
This includes apps like grafana loki, and cassandra. As @jrasell suggests a config parameter would be quite useful.
The two workarounds are:
1) Make the changes as @ygersie, suggests and compile it yourself,
2) Create a second cni bridge. The downside to this is that you loose consul-connect as it only works on network = "bridge"
Tried my hand at implementing this in https://github.com/hashicorp/nomad/pull/13834
Thanks @A-Helberg completely dropped off my radar again.
Also ran into this issue and saw that hairpinning packets send a SYN and never receive an ACK. Packet tracing logs through the iptables rules didn't seem to reveal anything out of the ordinary. Glad there is a fix/option coming for this, thanks!
I've left a comment here (https://github.com/hashicorp/nomad/pull/13834#pullrequestreview-1049877012) about whether we should implement this via exposing the CNI config directly (as in https://github.com/hashicorp/nomad/issues/13824), rather than adding another config knob.
Closing this one as completed by https://github.com/hashicorp/nomad/pull/15961.
While there have been discussions about a more flexible configuration approach, after further discussion we feel like adding more customization to the default bridge
may result in unexpected outcomes that are hard for us to debug. The bridge
network mode should be predictable and easily reproducible by the team so we can rely on common standard configuration.
Users that require more advanced customization are able to create their own bridge network using CNI. The main downside of this is that Consul Service Mesh requires network_mode = "bridge"
, but this is a separate feature request that is being tracked in #8953.
Feel free to 👍 and add more comments there.
Thank you everyone for the feedback!
When spinning up a job that uses CNI to setup the forwarding the container can't reach itself on the host port. This probably isn't a very common use case but when deploying a container that needs to discover itself and its peers through a Consul endpoint we get back the host IP + Port including itself. The connection towards the endpoint that references itself will then not work and gives timeouts. You can reproduce with the following job:
And then from the container a netcat times out:
I compiled a version of Nomad with hairpinMode enabled in the nomadCNIConfigTemplate which resolves the issue.
Can this be made either configureable or enabled by default or is there any particular reason why I wouldn't want this?