the-maldridge commented 2 years ago

Proposal

Right now the configuration for the nomad0 bridge device is hard coded. Among other things, this makes it impossible to use Consul Connect with nomad and IPv6.

Use-cases

This would enable IPv6 with the bridge, it would also allow the use of more advanced or configurable CNI topologies.

Attempted Solutions

To the best of my knowledge, there is no current solution to make consul connect and nomad both play nice with IPv6, or other similarly advanced dual-stack network configurations.

tgross commented 2 years ago

Hi @the-maldridge! This seems like a reasonable idea, and I've marked it for roadmapping.

The CNI configuration we use can be found at networking_bridge_linux.go#L141-L180. Note that also configures the firewall and portmap plugins.

One approach we could take here is to allow the administrator to override that template with a config file somewhere on the host. The configuration seems fairly straightforward, but then it's a matter of hunting down anywhere in the client code that has specific assumptions about that template and figuring out how to detect what's the right behavior from there.

the-maldridge commented 2 years ago

@tgross I like the template idea, it would provide the most flexibility while removing a dependency on a hard-coded string literal, something I always like to do. What do you think about using go:embed to include the default template rather than using the string literal as a means of simplifying the code that loads the various options. I can't remember off the top of my head though what version of Go that was introduced in to know if nomad already targets that version.

tgross commented 2 years ago

Yup, main targets go1.18, so we're fine on using embed and we're gradually moving some of our other embedded blobs over to that as well (the big lift still being the UI bundle).

pruiz commented 1 year ago

@tgross I think some sort of 'escape hatch' similar to those being used for envoy maybe an option here. If we could pass on some additional 'json' to some partas of the nomad's bridge conflist file, like adding additional plugins to the list, etc. that would make it easier to extend nomad's bridge CNI setup.

In my case that would allow for using cilium along with nomad's own bridge. And being able to mix and match Consul Connect enabled services with other policied by cilium. Or even have direct l3 reachability from tasks on different nomad nodes, tunneled by cilium under nomad's bridge.

pruiz commented 1 year ago

Another option that came to my mind could be using something like https://github.com/qntfy/kazaam in order to allow the user to specify some 'json transformations' to apply to normad's bridge CNI config in runtime.

This would work like:

A new optional setting (bridge_transform_rules?) holding the json string with the transformation rules to apply.
Modify buildNomadBridgeNetConfig() so if bridge_transform_rules is provided, apply such transformations using kazaam to before returning the built configuration.
This new (transformed) configuration would be passed onto CNI in order to build the final CNI bridge.
We could even pass some nomad variables to transformations, just in case this is useful in some contexts.

While this might not be the most straightforward mean to 'edit' the CNI template, this is probably the most flexible option, and can open a lot of possibilities for sysadmins to integrate nomad's bridge with many different networking systems.

Dunno what do you think @tgross.. if this seems 'acceptable' from hashicorp's point of view I could try to hack something.

Regards Pablo

tgross commented 1 year ago

@pruiz my major worry with that specific approach is that it introduces a new DSL into the Nomad job spec. Combine that with HCL2 interpolation and Levant/nomad-pack interpolation and that could get really messy. If we were going to allow job operator configuration of the bridge at all, I'm pretty sure we'd want it to be a HCL that generates the resulting JSON CNI config (which isn't all that complex of an object, in any CNI config I've seen at least).

That also introduces a separation of duties concern. Right now the cluster administrator owns the bridge configuration to the limited degree we allow that; expanding that configuration is what's been proposed as the top-level issue here. Extending some of that ownership to the job operator blurs that line.

Can you describe in a bit more detail (ideally with examples) what kind of configurations you couldn't do with the original proposal here (along with the cni mode for the network block)? That might help us get to a workable solution here.

pruiz commented 1 year ago

Hi @tgross,

I probably miss explained my self a bit. I was not proposing to add the new 'bridge_transform_rules' parameter to nomad's job spec. Just adding it to nomad client/host config..

IMHO, being able to fine-tune bridge's CNI config from job spec would be good, but it opens a lot more issues hard to solve, as bridge instance (and veth's attached to it) should be consistent among jobs for things like Consul Connect to work.

However, being able to customize bridge's CNI settings at host-level (ie. from /etc/nomad.d/nomad.hcl) opens (I think) a lot of potential. And keeping it (right now) restricted to cluster admins, makes sense (at least to me), as cluster admin is the one with actual knowledge of the networking & environment where the node lives on.

As per the new-DSL issue, I understand your point about adding another sub-DSL to config, but I just dont see how we can apply 'unlimited' modifications to a json document using HCL.

Adding some 'variables' to interpolate to the JSON emitted by networking_bridge_linux.go and replace them with new values at /etc/nomad.d/nomad.hcl, seems something workable, but as it happens with other similar approaches, the user N+1 is going to find he needs a new interpolable variable somewhere within the JSON which is not yet provided.. That's why I was looking into something more unrestricted.

pruiz commented 1 year ago

In my use case, for example, my idea would be to mix Consul Connect & Cilium on top of nomad's bridge.

In order to do so, my nomad's host config (/etc/nomad.d/nomad.hcl) would include something like:

bridge_network_subnet => $Per-Node-Network-Prefix
bridge_transform_rules => Transform emitted JSON with intended goals: * Add cilium's CNI driver to plugins chain, so we can use Cilium on nomad's native bridge (ie. using network=bridge, instead of network=cni/ on job [1]) Maybe update 'ipam' on bridge's plugin to use cilium as ipam (not sure, but may open additional integration) Maybe disable 'ipMasq' on bridge's plugin, so we control outgoing masquerade with cilium? (not sure) ** Maybe tune 'firewall' plugin currently present on JSON to skip outgoing masquerade for cilium prefixes (so jobs can reach other node using cilium when allowed to)

With this configuration applied on cluster nodes, I could be able to launch jobs using the native bridge (instead of cni/*) which will be able to make mixed use of Consul Connect and Cilium, enabling:

Connecting from a task on a job to another service exposed using Consul Connect (w/ mtls, etc.) as we do right now.
Connecting from a task on a job to another endpoint (external to nomad, or even internet) using plain networking, but allowed/disallowed based on Ciliums' bpf installed policy.
Connecting from a task on a job to another task using direct IP networking by using Ciliums' tuneling between nomad cluster nodes.
etc.

All at the same time and from within the same Task Group.

Regards Pablo

[1] Currently jobs using Cilium (by means of a network=cni/*) cannot use Consul Connect (and vice-versa)..

the-maldridge commented 1 year ago

That's a really complete and much better phrased explanation and feature matrix than I was typing up @pruiz, it sounds like we have almost identical use cases here. I also think this is something that realistically only a cluster root operator should change, since this is going to involve potentially installing additional packages at the host level to make it work.

As to the HCL/JSON issue, what about writing the transforms in HCL and then converting that to the relevant JSON as is already done for jobspecs? It adds implementation complexity for sure, but it also keeps the operator experience uniform, which it sounds like is a primary goal here.

tgross commented 1 year ago

Ok, I'm glad we're all on the same page then that this belongs to the cluster administrator.

So if I tried to boil down the "transformations" proposal a bit, the primary advantage here over simply pointing to a CNI config file is wanting to avoid handling unique-per-host CNI configuration files so that you can do things like IP prefixes per host (as opposed to having host configuration management do it). That seems reasonable given we already have Nomad creating the bridge. You'd still need a source for the per-host configuration though. Suppose we had a 90/10 solution here by supporting a cni_bridge_config_template (happy to workshop that name) that also supports interpolation, where would we put the values we're interpolating without having per-host configuration anyways? Take it from the environment somehow?

pruiz commented 1 year ago

Hi @tgross I think the cni_bridge_config_template seems like a good middle point, yes, cause:

PROs: Simplicity (obvious ;))
CONs: When a new version of nomad has a new 'default cni template' in code, cluster operator should be aware to upgrade it's local template to match changes.

And I think this is something everybody can cope with.

As for the actual template file to pass to cni_bridge_config_template, I think that could be a plain text file onto which nomad can perform such variable interpolations. Or a consul-template file which nomad can render (passing the variables to consul-template's engine), as nomad already uses consul-template for other similiar stuff. Dunno what do you guys think on this?

Last, with regard to interpolation variables, I think nomad could pass at a minimun the same values it is already using when generating bridge's json:

bridgeName => This is being passed nowadays already.
subnet => This is also being passed from 'bridge_network_subnet' config setting.
iptablesAdminChainName

And we could consider exposing as interpolation also (but not sure):

env.* => Environment variables (automatic if using consul-template)
node.* => Nomad's node current variables.
meta.* => Nomad's client meta values from config.

Regards

lgfa29 commented 1 year ago

Hi everyone 👋

After further discussion we feel like adding more customization to the default bridge may result in unexpected outcomes that are hard for us to debug. The bridge network mode should be predictable and easily reproducible by the team so we can rely on common standard configuration.

Users that require more advanced customization are able to create their own bridge network using CNI. The main downside of this is that Consul Service Mesh currently requires network_mode = "bridge", but this is a separate problem that that is being tracked in #8953.

Feel free to 👍 and add more comments there.

Thank you everyone for the ideas and feedback!

the-maldridge commented 1 year ago

Hmm, that's a frustrating resolution as it means that to use consul connect in conjunction with CNI I'd now need to edit every network block in every service template in every cluster, whether or not those tasks used a CNI network previously. At that point it seems like the better option to me is to abandon consul connect entirely and use a 3rd party CNI to achieve a similar result.

I'm following the other ticket, but it really doesn't look like any consideration is given there to the default path that nomad comes with out of the box. Any thoughts on how to continue to have working defaults and still enjoy both CNI and Consul Connect?

pruiz commented 1 year ago

@lgfa29 While consul connect is a good solution for common use cases, it is clearly lacking when trying to use it to deploy applications requiring more complex network setups (for example applications requiring direct [non-nat, non-proxied] connections from clients, or clusters requiring flexible connection between nodes on dynamically allocated ports, solutions requiring maxing out the network I/O performance of the host, etc.).

For such situations the only option available is to use CNI, but even this is somewhat limited on nomad (ie. CNI has to be setup per host, networking has to be defined on a job-basis and CNI stuff has to be already present and pre-deployed/running on nomad-server before deploying the job, one can not mix connect with custom-CNIs, etc.). And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task, nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

This is clearly an issue for nomad users, as this limits Consul Connect to simple use cases, forcing us to deploy anything not (let's say) Consul-Connect-Commpatible outside of nomad, on top of a different solution (for deployment, traffic policying, etc.) and relay on outbound gateway for providing access from nomad's jobs to such 'outside' elements.

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business. I just hope you guys can reconsider this issue.

Regards Pablo

brotherdust commented 1 year ago

@lgfa29 While consul connect is a good solution for common use cases, it is clearly lacking when trying to use it to deploy applications requiring more complex network setups (for example applications requiring direct [non-nat, non-proxied] connections from clients, or clusters requiring flexible connection between nodes on dynamically allocated ports, solutions requiring maxing out the network I/O performance of the host, etc.).

For such situations the only option available is to use CNI, but even this is somewhat limited on nomad (ie. CNI has to be setup per host, networking has to be defined on a job-basis and CNI stuff has to be already present and pre-deployed/running on nomad-server before deploying the job, one can not mix connect with custom-CNIs, etc.). And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task, nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

This is clearly an issue for nomad users, as this limits Consul Connect to simple use cases, forcing us to deploy anything not (let's say) Consul-Connect-Commpatible outside of nomad, on top of a different solution (for deployment, traffic policying, etc.) and relay on outbound gateway for providing access from nomad's jobs to such 'outside' elements.

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business. I just hope you guys can reconsider this issue.

Regards

Pablo

I, too, support @pruiz use-case. I had to abandon Hashistack altogether because of Nomad's opinions on CNI. Consul Connect is a good generic solution, but it leaves much to be desired in the flexibility department. I tried to plumb in Cilium using their (deprecated) Consul integration and after a few months I had to bag it. It doesn't seem impossible, but it's beyond my current capabilities. So, yes. What Pablo is proposing doesn't seem unreasonable and I ask HCI to reconsider.

lgfa29 commented 1 year ago

Hi everyone 👋

Thanks for the feedback. I think I either didn't do a good job explaining myself or completely misunderstood the proposal. I will go over the details and check with the rest of the team again to make sure I have things right.

Apologies for the confusion.

lgfa29 commented 1 year ago

Hi everyone :wave:

After a more thorough look into this I want to share what I have observed so far and expand on the direction we're planning to take for Nomad's networking story.

The main question I'm trying to answer is:

Does this proposal provide new functionality or is it a way to workaround shortcomings of the Nomad CNI implementation?

From my investigation so far I have not been able to find examples where a custom CNI configuration would not be able to accomplish the same results as a the proposed cni_bridge_config_template. That being said, I have probably missed several scenarios so I am very curious to hear more examples and use cases that I may have missed.

My first test attempted to validate the following:

Can I can create a custom bridge network based on Nomad's default bridge?

For this I copied Nomad's bridge configuration from the docs and changed the IP range.

mybridge.conflist

```json { "cniVersion": "0.4.0", "name": "mybridge", "plugins": [ { "type": "loopback" }, { "type": "bridge", "bridge": "mybridge", "ipMasq": true, "isGateway": true, "forceAddress": true, "ipam": { "type": "host-local", "ranges": [ [ { "subnet": "192.168.15.0/24" } ] ], "routes": [ { "dst": "0.0.0.0/0" } ] } }, { "type": "firewall", "backend": "iptables", "iptablesAdminChainName": "NOMAD-ADMIN" }, { "type": "portmap", "capabilities": {"portMappings": true}, "snat": true } ] } ```

I then used the following job to test each network.

example.nomad

```hcl job "example" { datacenters = ["dc1"] group "cache-cni" { network { mode = "cni/mybridge" port "db" { to = 6379 } } service { name = "redis" port = "db" provider = "nomad" address_mode = "alloc" } task "redis" { driver = "docker" config { image = "redis:7" ports = ["db"] } } task "ping" { driver = "docker" lifecycle { hook = "poststart" sidecar = true } config { image = "redis:7" command = "/bin/bash" args = ["/local/script.sh"] } template { data = <

I was able to access the allocations from the host via the port mapping, as expected from the default bridge network.

shell

```console $ nomad service info redis Job ID Address Tags Node ID Alloc ID example 192.168.15.46:6379 [] 7c8fc26d 4068e3b7 example 172.26.64.135:6379 [] 7c8fc26d f94a4782 $ nc -v 192.168.15.46 6379 Connection to 192.168.15.46 6379 port [tcp/redis] succeeded! ping +PONG ^C $ nc -v 172.26.64.135 6379 Connection to 172.26.64.135 6379 port [tcp/redis] succeeded! ping +PONG ^C $ nomad alloc status 40 ID = 4068e3b7-b4f9-b935-db17-784a693aa134 Eval ID = d60c7ff0 Name = example.cache-cni[0] Node ID = 7c8fc26d Node Name = lima-default Job ID = example Job Version = 0 Client Status = running Client Description = Tasks are running Desired Status = run Desired Description = Created = 1m28s ago Modified = 1m14s ago Deployment ID = 09df8981 Deployment Health = healthy Allocation Addresses (mode = "cni/mybridge"): Label Dynamic Address *db yes 127.0.0.1:20603 -> 6379 Task "ping" (poststart sidecar) is "running" Task Resources: CPU Memory Disk Addresses 48/100 MHz 840 KiB/300 MiB 300 MiB Task Events: Started At = 2023-02-07T22:59:47Z Finished At = N/A Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2023-02-07T22:59:47Z Started Task started by client 2023-02-07T22:59:46Z Task Setup Building Task Directory 2023-02-07T22:59:42Z Received Task received by client Task "redis" is "running" Task Resources: CPU Memory Disk Addresses 17/100 MHz 3.0 MiB/300 MiB 300 MiB Task Events: Started At = 2023-02-07T22:59:46Z Finished At = N/A Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2023-02-07T22:59:46Z Started Task started by client 2023-02-07T22:59:45Z Task Setup Building Task Directory 2023-02-07T22:59:42Z Received Task received by client $ nc -v 127.0.0.1 20603 Connection to 127.0.0.1 20603 port [tcp/*] succeeded! ping +PONG ^C $ nomad alloc status f9 ID = f94a4782-d4ad-d0e9-ced7-de90c1cfadf3 Eval ID = d60c7ff0 Name = example.cache-bridge[0] Node ID = 7c8fc26d Node Name = lima-default Job ID = example Job Version = 0 Client Status = running Client Description = Tasks are running Desired Status = run Desired Description = Created = 1m50s ago Modified = 1m35s ago Deployment ID = 09df8981 Deployment Health = healthy Allocation Addresses (mode = "bridge"): Label Dynamic Address *db yes 127.0.0.1:20702 -> 6379 Task "ping" (poststart sidecar) is "running" Task Resources: CPU Memory Disk Addresses 51/100 MHz 696 KiB/300 MiB 300 MiB Task Events: Started At = 2023-02-07T22:59:47Z Finished At = N/A Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2023-02-07T22:59:47Z Started Task started by client 2023-02-07T22:59:47Z Task Setup Building Task Directory 2023-02-07T22:59:42Z Received Task received by client Task "redis" is "running" Task Resources: CPU Memory Disk Addresses 14/100 MHz 2.5 MiB/300 MiB 300 MiB Task Events: Started At = 2023-02-07T22:59:47Z Finished At = N/A Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2023-02-07T22:59:47Z Started Task started by client 2023-02-07T22:59:46Z Task Setup Building Task Directory 2023-02-07T22:59:42Z Received Task received by client $ nc -v 127.0.0.1 20702 Connection to 127.0.0.1 20702 port [tcp/*] succeeded! ping +PONG ^C ```

So it seems to be possible to have a custom bridge network based off Nomad's default that behaves the same way, with the exception of some items that I will address below.

Next I wanted to test something different:

Can I create networks with other CNI plugins based on the Nomad bridge?

For the first test I used the macvlan plugin since it's a simple one.

macvlan

```json { "cniVersion": "0.4.0", "name": "mymacvlan", "plugins": [ { "type": "loopback" }, { "name": "mynet", "type": "macvlan", "master": "eth0", "ipam": { "type": "host-local", "ranges": [ [ { "subnet": "192.168.10.0/24" } ] ], "routes": [ { "dst": "0.0.0.0/0" } ] } }, { "type": "portmap", "capabilities": { "portMappings": true }, "snat": true } ] } ``` ```hcl job "example" { datacenters = ["dc1"] group "cache-bridge" { network { mode = "bridge" port "db" { to = 6379 } } service { name = "redis" port = "db" provider = "nomad" address_mode = "alloc" } task "redis" { driver = "docker" config { image = "redis:7" ports = ["db"] } } task "ping" { driver = "docker" lifecycle { hook = "poststart" sidecar = true } config { image = "redis:7" command = "/bin/bash" args = ["/local/script.sh"] } template { data = <

I wasn't able to get cross-network and host port mapping communication working, but allocations in the same network were able to communicate. I think this is where my lack of more advanced networking configuration is a problem and I wonder if I'm just missing a route configuration somewhere.

macvlan - same network

```hcl job "example" { datacenters = ["dc1"] group "cache-cni-1" { network { mode = "cni/mymacvlan" port "db" { to = 6379 } } service { name = "redis" port = "db" provider = "nomad" address_mode = "alloc" } task "redis" { driver = "docker" config { image = "redis:7" ports = ["db"] } } task "ping" { driver = "docker" lifecycle { hook = "poststart" sidecar = true } config { image = "redis:7" command = "/bin/bash" args = ["/local/script.sh"] } template { data = <

Next I tried a Cilium network setup since @pruiz and @brotherdust mentioned it. It is indeed quite challenging to get it working, but I think I was able to get enough running for what I needed. First I tried to run as an external configuration using the generic Veth Chaining approach because I think this is what is being suggested here, the ability to chain additional plugins to Nomad's bridge.

Cilium - custom CNI

Once again I started from the bridge configuration in our [docs](https://developer.hashicorp.com/nomad/docs/networking/cni#nomad-s-bridge-configuration) and chained `"type": "cilium-cni"` as mentioned in the [Cilium docs](https://docs.cilium.io/en/v1.12/gettingstarted/cni-chaining-generic-veth/#create-a-cni-configuration-to-define-your-chaining-configuration). ```json { "cniVersion": "0.4.0", "name": "cilium", "plugins": [ { "type": "loopback" }, { "type": "bridge", "bridge": "mybridge", "ipMasq": true, "isGateway": true, "forceAddress": true, "ipam": { "type": "host-local", "ranges": [ [ { "subnet": "192.168.15.0/24" } ] ], "routes": [ { "dst": "0.0.0.0/0" } ] } }, { "type": "firewall", "backend": "iptables", "iptablesAdminChainName": "NOMAD-ADMIN" }, { "type": "portmap", "capabilities": {"portMappings": true}, "snat": true }, { "type": "cilium-cni" } ] } ``` I also used the Consul KV store backend because that's what I most familiarized with, I don't think this choice influences the test. ```console $ consul agent -dev ``` I then copied the Cilium CNI plugin to my host's `/opt/cni/bin/`. I actually don't know where to download it from, so I just extract it from the Docker image. ```console $ docker run --rm -it -v /opt/cni/bin/:/host cilium/cilium:v1.12.6 /bin/bash root@df6cdba526a8:/home/cilium# cp /opt/cni/bin/cilium-cni /host root@df6cdba526a8:/home/cilium# exit ``` Enable some Docker driver configuration to be able to mount host volumes and run the Cilium agent in privileged mode. ```hcl client { cni_config_dir = "..." } plugin "docker" { config { allow_privileged = true volumes { enabled = true } } } ``` Start Nomad and run the Cilium agent job. ```hcl job "cilium" { datacenters = ["dc1"] group "agent" { task "agent" { driver = "docker" config { image = "cilium/cilium:v1.12.6" command = "cilium-agent" args = [ "--kvstore=consul", "--kvstore-opt", "consul.address=127.0.0.1:8500", "--enable-ipv6=false", ] privileged = true network_mode = "host" volumes = [ "/var/run/docker.sock:/var/run/docker.sock", "/var/run/cilium:/var/run/cilium", "/sys/fs/bpf:/sys/fs/bpf", "/var/run/docker/netns:/var/run/docker/netns:rshared", "/var/run/netns:/var/run/netns:rshared", ] } } } } ``` Make sure things are good. ```console $ sudo cilium status KVStore: Ok Consul: 127.0.0.1:8300 Kubernetes: Disabled Host firewall: Disabled CNI Chaining: none Cilium: Ok 1.12.6 (v1.12.6-9cc8d71) NodeMonitor: Disabled Cilium health daemon: Ok IPAM: IPv4: 2/65534 allocated from 10.15.0.0/16, BandwidthManager: Disabled Host Routing: Legacy Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled] Controller Status: 20/20 healthy Proxy Status: OK, ip 10.15.100.217, 0 redirects active on ports 10000-20000 Global Identity Range: min 256, max 65535 Hubble: Disabled Encryption: Disabled Cluster health: 1/1 reachable (2023-02-07T23:51:41Z) ``` Run job that uses `bridge` and `cni/cilium`. ```hcl job "example" { datacenters = ["dc1"] group "cache-cni" { network { mode = "cni/cilium" port "db" { to = 6379 } } service { name = "redis" port = "db" provider = "nomad" address_mode = "alloc" } task "redis" { driver = "docker" config { image = "redis:7" ports = ["db"] } } task "ping" { driver = "docker" lifecycle { hook = "poststart" sidecar = true } config { image = "redis:7" command = "/bin/bash" args = ["/local/script.sh"] } template { data = <

Although far from a production deployment, I think this does show that it's possible to setup custom CNI networks without modifying Nomad's default bridge.

Except for the points I mentioned earlier, so I will try to list them all here and open follow-up issues for us to address them.

The logic to cleanup iptables rules looks for rules with nomad as comment. This is not true for custom CNI networks so they may leak.
Nomad automatically creates an iptables rule to forward traffic to its bridge. Currently this may need to be done manually if the rule is not present or if the custom CNI network uses a different firewall chain.
CNI networks are not reloaded on SIGHUP, so they require the agent to restart. CNI plugins are sometimes deployed as fully bundled artifacts, like Helm charts, that are able to apply CNI configs to a live cluster.
CNI configuration and plugins must be placed in Nomad clients, usually requiring and additional configuration management layer.
Connect integration assumes bridge network mode at job validation.
Communication across networks is not guaranteed and may require additional user configuration.
Allocations are limited to a single network preventing them from accessing multiple networks at the same time (like a CNI network and bridge).
Lack of documentation about CNI networking and how to deploy popular solutions from vendors.

These are all limitations of our current CNI implementation that we need to address, and are planning to do so. The last item is more complicated since it requires more partnership and engagement with third-party providers, but we will also be looking into how to improve that.

What's left to analyze is main the question:

Does this proposal provide new functionality or is it a way to workaround shortcomings of the Nomad CNI implementation?

For this I applied the same Cilium configuration directly to the code that generates the Nomad bridge. If I understood the proposal correctly chaining CNI plugins to the Nomad bridge would be the main use case for this feature, but please correct me if I'm wrong.

But things were not much better, and most of the items above were still an issue.

Cilium - embedded in Nomad

The first thing you notice is what I mentioned in my previous comment. `network_mode = "bridge"` now behaves completely differently from usual. Trying to run the default `exmaple.nomad` job in `bridge` mode results in failures because Nomad's `bridge` is now actually Cilium. ```hcl job "example" { datacenters = ["dc1"] group "cache" { network { mode = "bridge" port "db" { to = 6379 } } task "redis" { driver = "docker" config { image = "redis:7" ports = ["db"] auth_soft_fail = true } resources { cpu = 500 memory = 256 } } } } ``` ```console $ nomad job status example ID = example Name = example Submit Date = 2023-02-08T00:41:51Z Type = service Priority = 50 Datacenters = dc1 Namespace = default Status = running Periodic = false Parameterized = false Summary Task Group Queued Starting Running Failed Complete Lost Unknown cache 0 1 0 1 0 0 0 Latest Deployment ID = e0e9f013 Status = running Description = Deployment is running Deployed Task Group Desired Placed Healthy Unhealthy Progress Deadline cache 1 2 0 1 2023-02-08T00:51:51Z Allocations ID Node ID Task Group Version Desired Status Created Modified 30ffb0ce 643dc5fa cache 0 run pending 23s ago 23s ago d7c34e23 643dc5fa cache 0 stop failed 1m30s ago 22s ago $ nomad alloc status d7 ID = d7c34e23-0c11-e57f-1b28-ff2274264854 Eval ID = eccbefd9 Name = example.cache[0] Node ID = 643dc5fa Node Name = lima-default Job ID = example Job Version = 0 Client Status = failed Client Description = Failed tasks Desired Status = stop Desired Description = alloc was rescheduled because it failed Created = 1m45s ago Modified = 37s ago Deployment ID = e0e9f013 Deployment Health = unhealthy Replacement Alloc ID = 30ffb0ce Allocation Addresses (mode = "bridge"): Label Dynamic Address *db yes 127.0.0.1:30418 -> 6379 Task "redis" is "dead" Task Resources: CPU Memory Disk Addresses 500 MHz 256 MiB 300 MiB Task Events: Started At = N/A Finished At = 2023-02-08T00:42:29Z Total Restarts = 0 Last Restart = N/A Recent Events: Time Type Description 2023-02-08T00:42:30Z Killing Sent interrupt. Waiting 5s before force killing 2023-02-08T00:42:29Z Alloc Unhealthy Unhealthy because of failed task 2023-02-08T00:42:29Z Setup Failure failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="cilium-cni" failed (add): unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get "http:///var/run/cilium/cilium.sock/v1/config": dial unix /var/run/cilium/cilium.sock: connect: no such file or directory Is the agent running? 2023-02-08T00:41:51Z Received Task received by client ``` Running the Cilium agent and deleting the endpoint labels as before fixes the problem and the allocation is now at least healthy. But, also like before, we can't access the task from the host or outside the Cilium network. Again, this is probably my fault and could be fixed with proper network configuration. Since we're in `bridge` we are able to run Connect jobs, so I tested the `countdash` example generated from `nomad job init -short -connect`, but it did not work. As I described in https://github.com/hashicorp/nomad/issues/8953, I think there's more work missing than just removing the validations.

And so, looking at the list of issues above, the proposal here would only incidentally fix the first two items because of the way things are named and currently implemented, and both items are things we need to fix for CNI anyway.

Now, to address some of the comments since the issue was close.

From @the-maldridge.

it means that to use consul connect in conjunction with CNI I'd now need to edit every network block in every service template in every cluster, whether or not those tasks used a CNI network previously.

Having to update jobspecs is indeed an unfortunate consequence, but this is often true for new features in general and, hopefully, it's a one-time process. Modifying Nomad's bridge would also like require all allocations to be recreated so a migration of workload is also expected in both scenarios. The upgrade path also seems risky? How would you go from the default bridge to a customized bridge?

At that point it seems like the better option to me is to abandon consul connect entirely and use a 3rd party CNI to achieve a similar result.

Nomad networking features and improvements have been lagging and we're planning to address them. CNI, Consul Connect, IPv6 (which was the original use case you mentioned) are all things we are looking into improving, but unfortunately I don't have any dates to provide at this point to help you make a decision on which tool to use.

I'm following the other ticket, but it really doesn't look like any consideration is given there to the default path that nomad comes with out of the box.

You are right, the issue I linked was about enabling Consul Connect on CNI networks. https://github.com/hashicorp/nomad/issues/14101 and https://github.com/hashicorp/nomad/issues/7905 are about IPv6 support in Consul Connect and Nomad's bridge.

Any thoughts on how to continue to have working defaults and still enjoy both CNI and Consul Connect?

Right now the only way I can think of to solve your issue is to run a patched version of Nomad to customize the hardcoded bridge config. But even that I'm not sure if it will be enough to fully enable Connect with IPv6.

From @pruiz.

While consul connect is a good solution for common use cases, it is clearly lacking when trying to use it to deploy applications requiring more complex network setups

Agreed. We (the Nomad team) need to find a way to address this and better integrate with other networking solutions. We don't have any specifics at this point, but community support is always a good start and much appreciated!

For such situations the only option available is to use CNI, but even this is somewhat limited on nomad

:100: we need to improve our CNI integration.

CNI has to be setup per host, networking has to be defined on a job-basis and CNI stuff has to be already present and pre-deployed/running on nomad-server before deploying the job

That's correct, but so would be the proposal here if I understood it correctly?

one can not mix connect with custom-CNIs

Right, and the plan is to address this in https://github.com/hashicorp/nomad/issues/8953. It may be that removing the validation is enough. Having more people test the custom binary I provided there would be very helpful.

And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task

That's also true, but also not covered by this proposal? As far as I know, Kubernetes also suffers from the same issue and there are meta-plugins to multiplex different networks, like Multus. I have this in my list above to be created as a follow-up issue.

nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

Yup, that's covered in https://github.com/hashicorp/nomad/issues/8953. One thing to clarify is what do you mean by "mixing jobs". Do you envision an alloc that uses Consul Connect to be able to reach an alloc on Cilium for example? If that's the case I'm not sure if it would work without a gateway :thinking:

This is clearly an issue for nomad users, as this limits Consul Connect to simple use cases, forcing us to deploy anything not (let's say) Consul-Connect-Commpatible outside of nomad, on top of a different solution (for deployment, traffic policying, etc.) and relay on outbound gateway for providing access from nomad's jobs to such 'outside' elements.

I'm sorry, I didn't quite follow this part. Are you talking about, for example, having to deploy the Cilium infrastructure to use something beyond Connect?

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

This is the void we expect CNI to fill by allowing users to create their own custom networks that fits their specific needs. This specific item is not about commercial support but feature support in general. We try to be careful about backwards compatibility and this would introduce a feature we expect to deprecate. I understand the frustration but, historically, we treat code shipped as code being used. For experimentations a temporary fork may be the best approach.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business.

This is not a business decision, and I apologize if I made it sound like one. This was a technical decision as we found that arbitrary modifications to the default bridge network could be dangerous as it can break things in very subtle ways and the Nomad bridge has a predictable behaviour that we often rely on to debug issues.

We are always happy to receive contributions, and I hope this doesn't discourage you from future contributions (we have lots to do!). But sometimes we need to close feature requests to make sure we are moving towards a direction we feel confident in maintaining.

I just hope you guys can reconsider this issue.

Always! As I mentioned, the main point that I may be missing is understanding what you would be able to do with this feature that would not be possible with a well functioning CNI integration. Could you provide an example of what you would like to add to Nomad's bridge config? That can help us understand the use case better and yes, we are always willing to reconsider.

From @brotherdust.

I had to abandon Hashistack altogether because of Nomad's opinions on CNI. Consul Connect is a good generic solution, but it leaves much to be desired in the flexibility department. I tried to plumb in Cilium using their (deprecated) Consul integration and after a few months I had to bag it.

That's unfortunate but definitely understandable given where we are right now. Anything specific you could share to help us improve?

To finish this (already very) long comment I want to make sure that it is clear that closing this issue it's just an indication that we find a stronger and better CNI integration to be a better approach for customized networking. What "stronger and better" means depends a lot from your input, so I appreciate all the discussion and feedback so far, please keep them coming :slightly_smiling_face:

brotherdust commented 1 year ago

@lgfa29 , thank you for your thoughtful and detailed response. I'm sure it took some time out of your regular activities and I can appreciate it!

I agree with you 100% that Nomad needs better CNI integration and much better IPv6 support.

That's unfortunate but definitely understandable given where we are right now. Anything specific you could share to help us improve?

I need some time to gather my thoughts into something more cogent. I'll get back to you soon.

the-maldridge commented 1 year ago

Wow, kudos for such an in-depth survey of the available options. I'm truly impressed that you got Cilium working and were able to use it even in a demo environment.

I think perhaps the deeper issue that I encounter with this while looking at it is that there is a constant upgrade treadmill to operate an effective cluster. A treadmill that often times involves tracking down users in remote teams, who do not have dedicated operations resources but still expect the things they want to do in the hosted cluster environment to work. The kubernetes world solved this long ago with mutating ingress controllers to be able to monkey-patch jobspecs on the way in, and while I recognize the good arguments the Nomad team has made in the past against user-hosted ingress controllers, I can't deny that that converts operations teams into the very same mutating controller resources.

As to having to update jobspecs to make use of the new features, I remember the 0.12 upgrade cycle far too well when I spent about a week trying to figure out why none of my network config worked as I understood it to at the time. I'm really starting to wonder if the answer here is to just not use any of the builtin networking at all, to always stand up a CNI network that I own, and then put everything there. That seems to be the supported mechanism for managing a stable experience for downstream Nomad consumers, would you agree?

brotherdust commented 1 year ago

Edit: added mention of Fermyon-authored Cilium integration with Nomad.

@lgfa29 , thank you for your thoughtful and detailed response. I'm sure it took some time out of your regular activities and I can appreciate it!

I agree with you 100% that Nomad needs better CNI integration and much better IPv6 support.

That's unfortunate but definitely understandable given where we are right now. Anything specific you could share to help us improve?

I need some time to gather my thoughts into something more cogent. I'll get back to you soon.

Ok. Thoughts gathered! First, I want to qualify what I'm describing with the fact that I am, first and foremost, a network engineer. This isn't to say that I have expert opinions in this context, but to indicate that I might have a different set of tools in my bag than a software engineer or developer; therefore, there's a danger that I'm approaching this problem from the wrong perspective and I'm more than willing to hear advice on how to think about this differently.

The goals enumerated below are enumerated for a reason: we'll be using them for reference later on.

1. Hardware Setup

3-node bare-metal cluster, AMD EPYC 7313P 16C CPU, 128GB RAM, SAS-backed SSD storage
Each node has a bonded 4x25Gbps connection to the ToR switch

2. Design Goals

2.1 General

Zero-trust principles shall be applied wherever possible and feasible. If it cannot be applied, justification shall be documented in ADR.

2.2 Workload Characteristics

2.2.1 Types

Mostly containers (preferably rootless), with a sprinkling of VMs
If it's a VM, use firecracker or something like it

2.2.2 Primary Use-Cases

IPFIX flow data processing, storage, and search
Private PKI for device trust (SCEP) and services
Systems/network monitoring, telemetry collection and analysis

2.3 Security

I have complete control of the network, so inter-node transport encryption is not required. In fact, it may be a detriment to performance and should be avoided if possible. HOWEVER:
Keeping the authentication component of mTLS is desirable to prevent unauthorized or unwanted traffic
Cluster ingress/egress shall be secured with TLS where possible. Where it's not possible, IP whitelisting will be used.
User authentication will be provided by Azure AD, 2FA required
Service authentication may also be provided by Azure AD; tokens shall be issued from Vault
Role-based access controls and principle of least privilege will be strictly enforced
Vault will be automatically unsealed from Azure KMS

2.4 PKI

Offline CA root is HSM backed and physically secured
Online intermediate CAs shall be used for issuing certificates or as a backing for an RA
Intended use of certificates (at present) shall be as-follows:
1. SCEP registration authority for network devices (requires a dedicated non-EC intermediate CA!)
2. TLS for cluster ingress
3. mTLS for inter-node communications (encryption not required, just the authentication component if possible)

2.5 Networking

L3-only, IPv6-only using public addressing shall be preferred
Nomad groups shall be allocated a fixed, cluster-wide IPv6 address during their lifecycle, even if it migrates to another node
Nomad group addresses shall be advertised to the network using a BGP session with the ToR switch
Load balancing, if needed, shall be handled primarily by ECMP function on the ToR switch. If more control is required, a software LB shall be spun up as nomad.job.type = service

2.6 Storage

Hyper-converged storage with options to control how data is replicated, for performance use-cases where the data need not be replicated and only stored on the node where Nomad task lives

3. How It Went Down

I set off finding the pieces that would fit. It eventually came down to k8s and Hashistack. I selected Hashistack because it's basically the opposite of k8s. I'll skip my usual extended diatribe about k8s and just say the k8s is very... opinionated... and is the ideal solution for boiling the ocean, should one so desire.

Pain Points

In a general sense, the most difficult parts of the evaluation comes down to one thing: where Hashistack doesn't cover the use-case, a third-party component must be integrated. Or, if it does cover the use-case, the docs are confusing or incomplete.

CNI

To the detriment of all, all the cool kids build service-mesh CNIs for k8s. They use k8s APIs, CRDs and such; things that Nomad (and Consul, indirectly) do not understand; and, frankly, shouldn't. Nomad has CNI support, but it's very basic in the sense that it cannot be programmatically or natively configured via Nomad jobspec. It seems there is some template functionality I wasn't aware of, as indicated by some of the content of this thread, so I'll have to revisit that.

I very much agree with @lgfa29 that probably the best outcome is just to integrate Cilium as part of Nomad. That creates its own burden on the Hashicorp, so I'm not sure if they're going to be willing to do that. In this instance, I am happy to volunteer some time to maintain the integration once it is completed.

Which brings me to a related note: I saw a HashiConf talk by Taylor Thomas from Fermyon. In it he describes a full-featured Cilium integration with Nomad they are planning on open sourcing. It hasn't happened yet due to time constraints, so I reached out to them to see what the timeline is and if they would like some help. Hopefully I or someone more qualified (which is pretty much anyone) can get the ball rolling on that. If anyone wants me to keep them up to date on this item, let me know.

PKI

I realize this seems somewhat off-subject, but it is somewhat related.

This article covers some of the issues I experience, which I'll quote from here:

What does it REALLY takes to operate a whole hashistack in order to support the tiny strawberry atop the cake, namely nomad?

First of all, vault, which manages the secrets. To run vault in a highly available fashion, you would either need to provide it with a distributed database (which is another layer of complexity), or use the so called: integrated storage, which, needless to say is based on raft1. Then, you have to prepare an self signed CA1 in order to establish the root of trust, not to mention the complexity of unsealing the cluster on every restart manually (without the help of cloud KMS).

The next is consul, that provides service discovery. Consul models the connectivity between nodes into two categories, lan and wan, and each lan is a consul datacenter. Consul datacenters federate over the wan to form a logical cluster. However, data is not replicated across datacenters, it's only stores in respective datacenters (with raft2) and requests destined for other datacenters are simply forwarded (requiring full connecitity across all consul servers). For the clustering part, a gossip protocol is used, formaing a lan gossip ring1 per datacenter, and a wan gossip ring2 per cluster. In order to encrypt connections between consul servers, we need a PSK1 for the gossip protocol, and another CA2 for rpc and http api. Although the PSK and the CA can be managed by vault, there is no integration provided, you have to template files out of the secrets, and manage all rotations by yourself. And, if you wanna use the consul connect feature (a.k.a. service mesh), another CA3 is required.

Finally, we've got to nomad. Luckily, nomad claims to HAVE consul integration, and can automatically bootstrap itself given a consul cluster is beneath it. You would expect (as I do) that nomad can rely on consul for interconnection and cluster membership, but the reality is a bloody NO. The so called integration provides nothing more than saving you typing a seed node for cluster bootstrap, and serves no purpose beyond that. Which means, you still have to run a gossip ring3 per nomad region (which is like a consul datacenter) and another gossip ring4 for cross region federation. And, nomad also stores its state in per region raft3 clusters. To secure nomad clusters, another PSK2 and CA4 is needed.

Let's recap what we have now, given that we run a single vault cluster and 2 nomad regions, each containing 2 consul datacenters: 2 PSKs, 4 CAs, 7 raft clusters, 8 gossip rings. And all the cluster states are scattered across dozens of services, making the backup and recovery process a pain in the ass.

So, besides experiencing exactly what the author mentioned, I can add: if you want to integrate any of these components with an existing enterprise CA, beware that, for example:

The bare minimum set of SAN's required for TLS and mTLS to function, what is recommended, and what should be avoided are poorly or not documented at all.
The specific algorithms supported by a given Hashistack component are not documented. Had to learn the hard way that ed25519 isn't supported (or wasn't when I was doing eval)
Debugging TLS issues in Hashistack is nearly impossible

I think what happened is that the developers assumed that we'd want to use the self-signed CA that came with each component and nothing else. So, they weren't expecting a particular kind of error, or didn't see the need to comprehensively document what a certificate should look like. For lab purposes, this is acceptable. When one is trying to set up a production cluster, it's pretty rough.

On a final note, I seriously appreciate that this is open source software and that I am more than welcome to provide a PR. I even thought about justifying an enterprise license. But, in this particular case, a PR wouldn't be enough to address the architectural decisions that lead to where we are now; and, based on my experience with enterprise support contracts, would probably never be addressed unless there were some serious money on the table. I get it, I do. My expectations are low; but I thought it was at least worth the time to write all this out so that you would benefit from my experience.

Thanks again! Seriously great software!

pruiz commented 1 year ago

Hi @lgfa29,

First, thnks for the thoughtful response, I'll try to answer some points I think relevant below.. ;)

Hi everyone 👋

After a more thorough look into this I want to share what I have observed so far and expand on the direction we're planning to take for Nomad's networking story.

The main question I'm trying to answer is:

Does this proposal provide new functionality or is it a way to workaround shortcomings of the Nomad CNI implementation?

From my investigation so far I have not been able to find examples where a custom CNI configuration would not be able to accomplish the same results as a the proposed cni_bridge_config_template. That being said, I have probably missed several scenarios so I am very curious to hear more examples and use cases that I may have missed.

I think the main deviation from your tested scenarios and the one I have in mind is that I want a single task (within a given allocation) should be able to use both Consul Connect and Cilium's networking. So the job would declare a single network stanza inherited by any tasks on it (which can be just a single one), and then that would work like:

When the (dockerized) process tries to connect to localhost (ie. to a port where envoy is listening) that would work (cilium will just allow this traffic), thus the process will be able to use Connect.
When the process tries to connect anywhere else (another host on local lan or internet), the traffic will flow thru the default route towards the bridge, and there Cilium will be the one taking the spot. That maybe just filtering/allowing the traffic, tunneling it (towards other cluster node), or applying NAT (or letting nomad's own NAT setting do their own)
In the opposite direction, traffic coming in from within a cilium tunnel will be filtered/controled by Cilium. And specifically traffic coming to envoy [from other cluster node] will be allowed by default, so Consul Connect works on both directions.
Then when it comes to addressing (so we can actually have tuneling), there are two options:
- Just let the operator to set non-overlapping network prefixes for each host at nomad's configuration file (and let nomad's bridge handle addressing)
- Use CNI chaining so the nomad bridge's uses Cilium's own IPAM. (Which maybe more flexible/needed for some deployments)

This is the kind of integration between Consul Connect & Cilium I want to achieve.

[...]

From @pruiz.

[...]

one can not mix connect with custom-CNIs

Right, and the plan is to address this in #8953. It may be that removing the validation is enough. Having more people test the custom binary I provided there would be very helpful.

That would be an option for me, but given that we can use Connect on a custom CNI network, hopefully delegating to nomad's deployment/management of envoy proxy stuff.

And, at the same time, there is no solution for having more than "one networking" (ie. CNI plus bridge) for a single Task

That's also true, but also not covered by this proposal? As far as I know, Kubernetes also suffers from the same issue and there are meta-plugins to multiplex different networks, like Multus. I have this in my list above to be created as a follow-up issue.

Yeah, I know, kubernetes is similar here, but my point was that support for more than one networking, could be another way around for this.. just provide my tasks with one network 'connecting' to nomad's bridge, and another one connecting to cilium. :)

nor there is a clear solution for mixing jobs using Consul Connect and jobs using CNI.

Yup, that's covered in #8953. One thing to clarify is what do you mean by "mixing jobs". Do you envision an alloc that uses Consul Connect to be able to reach an alloc on Cilium for example? If that's the case I'm not sure if it would work without a gateway 🤔

This is what I explained at the top: I think we could make Connect and Cilium work ontop of the same bridge.. and have both working together side by side.

I understand hashicorp needs a product that can be supported with some clear use case and limits. But at the same time we as community need some extensibility for use cases not needing the be covered by commercial hashicorp support options. That's why the idea of this being a setting for extending the standard nomad feature made sense to me. HashiCorp could simply label this as 'community supported-only' or something like that and focus on enhancing consul connect, but at the same time let the community work around until something better arrives.

This is the void we expect CNI to fill by allowing users to create their own custom networks that fits their specific needs. This specific item is not about commercial support but feature support in general. We try to be careful about backwards compatibility and this would introduce a feature we expect to deprecate. I understand the frustration but, historically, we treat code shipped as code being used. For experimentations a temporary fork may be the best approach.

As stated I was willing to provide a PR for this new feature, but right now, I feel a bit stranded, as I don't really understand why not supporting a use case which on nomad code-base only implies being able to extend the CNI config, and which can be declared 'community supported' if that's a problem for hashicorp's business.

This is not a business decision, and I apologize if I made it sound like one. This was a technical decision as we found that arbitrary modifications to the default bridge network could be dangerous as it can break things in very subtle ways and the Nomad bridge has a predictable behaviour that we often rely on to debug issues.

No bad feelings ;), I understood your point. Just wished we could find a iterim solution for the current limitations of Connect.

Regards Pablo

lgfa29 commented 1 year ago

@the-maldridge

The kubernetes world solved this long ago with mutating ingress controllers to be able to monkey-patch jobspecs on the way in, and while I recognize the good arguments the Nomad team has made in the past against user-hosted ingress controllers, I can't deny that that converts operations teams into the very same mutating controller resources.

I've heard some people mentioning an approach like this before (for example, here is Seatgeek speaking at HashiConf 2022), but I'm not sure if there's been any final decision on this by the team.

I'm really starting to wonder if the answer here is to just not use any of the builtin networking at all, to always stand up a CNI network that I own, and then put everything there. That seems to be the supported mechanism for managing a stable experience for downstream Nomad consumers, would you agree?

That's the direction we're going. The built-in networks should be enough for most users and a custom CNI should be used by those that need more customization. The problem right now (in addition to the CNI issues mentioned previously) is that there's a big gap between the two. We need to figure out a way to make CNI adoption more seamless.

@brotherdust thanks for the detail report of your experience!

To the detriment of all, all the cool kids build service-mesh CNIs for k8s. They use k8s APIs, CRDs and such; things that Nomad (and Consul, indirectly) do not understand; and, frankly, shouldn't.

Yup, that's the part about partnerships I mentioned in my previous comment. But those can take some time to be established. The work that @pruiz has done in Cilium is huge for this!

Nomad has CNI support, but it's very basic in the sense that it cannot be programmatically or natively configured via Nomad jobspec. It seems there is some template functionality I wasn't aware of, as indicated by some of the content of this thread, so I'll have to revisit that.

Could you expand a little on this? What kind of dynamically values would you like to set and where?

I very much agree with @lgfa29 that probably the best outcome is just to integrate Cilium as part of Nomad.

Maybe I misspoke, but I don't expect any vendor specific code in Nomad at this point. The problem I mentioned is that, in theory, the CNI spec is orchestrator agnostic but in practice a lot of plugins have components that rely on Kubernetes APIs and, unfortunately, there is not much we can do about it.

I am happy to volunteer some time to maintain the integration once it is completed.

And that's another important avenue as well. These types of integration are usually better maintained by people that actually use them, which is not our case. Everything I know about Cilium at this point was what I learned from community in #12120 🙂

I think what happened is that the developers assumed that we'd want to use the self-signed CA that came with each component and nothing else. So, they weren't expecting a particular kind of error, or didn't see the need to comprehensively document what a certificate should look like. For lab purposes, this is acceptable. When one is trying to set up a production cluster, it's pretty rough.

I would suggest opening a separate issue for this (if one doesn't exist yet).

But, in this particular case, a PR wouldn't be enough to address the architectural decisions that lead to where we are now

You're right, this will be a big effort that will require multiple PRs, but my is to break it down into smaller issues (some of them listed in my previous comment already) so maybe there will be something smaller that you can contribute 🙂

Things like documentation, blog posts, demos etc. are also extremely valuable to contribute.

Thanks again! Seriously great software! ❤️

@pruiz

I think the main deviation from your tested scenarios and the one I have in mind is that I want a single task (within a given allocation) should be able to use both Consul Connect and Cilium's networking.

Yup, I got that. But I want to make sure we're on the same as to why I closed this issue. So imagine the feature requested here were implemented, which cni_bridge_config_template would you write to accomplish what you're looking for? And what is preventing you from using a separate CNI network for this?

From what I gathered so far the only things preventing you from doing what you want are shortcomings in our CNI implementation. If that's not the case I would like to hear what cni_bridge_config_template can do that a custom CNI would not be able to.

That would be an option for me, but given that we can use Connect on a custom CNI network, hopefully delegating to nomad's deployment/management of envoy proxy stuff.

Yes, the sidecar deployment is conditional on service.connect not the network type.

I would appreciate if you could test the binary I have linked in https://github.com/hashicorp/nomad/issues/8953#issuecomment-1411344922 to see if it works for you.

Yeah, I know, kubernetes is similar here, but my point was that support for more than one networking, could be another way around for this.. just provide my tasks with one network 'connecting' to nomad's bridge, and another one connecting to cilium. :)

Yup, I have this on my list and I will open a new issue about multiple network interfaces per alloc 👍

lgfa29 commented 1 year ago

Hi all 👋

I just wanted to note that, as mentioned previously, I've created follow-up issues on specific areas that must be improved. You can find them linked above. Feel free to 👍, add more comments there, or create new issues if I missed anything.

Thanks!

brotherdust commented 1 year ago

@lgfa29 , thanks much!

hashicorp / nomad

Mechanism for editing the nomad0 CNI template #13824

Proposal

Use-cases

Attempted Solutions

1. Hardware Setup

2. Design Goals

2.1 General

2.2 Workload Characteristics

2.2.1 Types

2.2.2 Primary Use-Cases

2.3 Security

2.4 PKI

2.5 Networking

2.6 Storage

3. How It Went Down

Pain Points

CNI

PKI