networkop / meshnet-cni

a (K8s) CNI plugin to create arbitrary virtual network topologies
BSD 3-Clause "New" or "Revised" License
117 stars 27 forks source link

Emulate netlink Link Attributes #5

Open vparames86 opened 5 years ago

vparames86 commented 5 years ago

Currently we can't set the link attributes like queuing which can be used to emulate the link parameters like speed. Enabling these features will aid in creating a more realistic network lab emulation.

Cerebus commented 2 years ago

I've been looking into this and it seems to me the best way is to add shaping to koko, add shaping data to the Topology CRD, add shaping data to the returns from meshnetd, and then to each invocation of koko.MakeVeth() and koko.MakeVxLan(). Lastly, the CNI spec says that ADD commands are used to modify networks, so would need to update the qdiscs even when both local and peer exist (currently this is a noop branch in meshnet.go).

The alternative is a separate controller daemonset watching Topology resources and filtering on status.src_ip that adds a netem qdisc to each interface on Topology CREATE/PUT events. That would be quicker to implement but seems to be a waste.

networkop commented 2 years ago

I think the first option makes sense. In fact, there's a standard CNI plugin now that implements bandwidth shaping, so would be interesting to see if its code can be re-used as-is https://github.com/containernetworking/plugins/blob/76307bf0f6929b39986f1c6f5e4b6abdb82b8815/plugins/meta/bandwidth/ifb_creator.go#L61

Cerebus commented 2 years ago

I hadn't thought about a discrete chained plugin for this, but I like it.

The extant bandwidth plugin only applies rate limits to the CNI_IFNAME environment parameter so it won't work as-is, but it should be simple enough write a new one. So we add per-iface shaping data (rate, burst, delay, loss, corruption, duplication, and reordering) to Topology, record it in meshnetd, and then fetch and apply it by looping over the prevResult output on ADD and CHECK commands.

(ETA) It's not yet clear to me how or when the runtime decides to invoke CHECK, however. If it can't be reliably triggered, then there's a problem. Experimenting with bandwidth and the runtimeConfig annotations it looks like it never modifies after Pod creation.

(ETA again) Seems to happen eventually, but still poring over the CNI spec to see if it's defined or left up to the runtime. I deployed an iperf server and client and annotated an egress limit on the client after container start, and it didn't take effect right away. When I came back to it an hour later the limit was in effect.

networkop commented 2 years ago

yep, it looks like some of their functions, e.g. func CreateEgressQdisc are generic and not tied to a specific interface so we should be able to pass any interface name to it ( I think).

Cerebus commented 2 years ago

Unfortunately that code only handles a TBF, so loss/corruption/delay/dupe/reorder will need new code.

networkop commented 2 years ago

yep, but it's a good place to start (or a good template to copy?)

Cerebus commented 2 years ago

OK, so dockershim's implementation of the bandwidth capabilities convention is just plain broken; it sets the rate properly but sets burst to MaxInt32. So depending on all kinds of variables, any given iperf run can spike to max throughput no matter what limit is set.

So the upshot: AFAICT kubelet only fires CNI plugins once. The CHECK command seems to be ... nowhere I can find. Even then, the spec says this needs to do nothing or return an error, and I haven't figured out what happens if kubelet gets an error.

So that's not looking good for a chained plugin, b/c the behavior I'm after needs to be completely dynamic; I want to change qdiscs at runtime, not just at boot time.

networkop commented 2 years ago

in this case, what you mentioned above as option#2 (daemonset list/watching Topology resources and changing qdisc) seems like the only option. I'm not 100% how you'd be able to get a handle for a veth link from root NS once it's been moved to a container.

what's your use case for doing this at runtime? the topology is still built at create time.

Cerebus commented 2 years ago

I'm building a general network emulation platform on k8s to support engineering studies at higher fidelity than discrete event sims (like OPNET) and with less resource requirements than VM based emulators (like GNS3) with the ability to do real software-in-the-loop (and eventually hardware-in-the-loop). That means I need interactivity (comes for free on k8s) and scripted events. I'm also looking at hooking up with Chaos Toolkit as another driver.

(ETA) I'm already doing tc in a sidecar using downward API, but it's dependent on inotify which has resource limits, and I'm looking for a more efficient solution.

networkop commented 2 years ago

right, gotcha. yeah, sidecars could be a viable option. networkservicemesh.io works this way. In fact, this would even make it possible to change the topology at runtime. but then you won't need meshnet-cni at all.

Cerebus commented 2 years ago

Back to looking at this feature req. now that I have a multinode cluster working again.

I'mna take a stab at adding this behavior to meshnetd; it should be straightforward to register a watch on topos, filter on status.src_ip, and then apply a simple canned tc-netem qdisc to the namedlocal_intf in status.net_ns.

I've done similar filters in controllers I've written in Python, so it should work b/c as a daemonset there's no chance that instances will fight with one another. That will make the traffic shaping dynamic while not interfering with how the plugin sets things up.

A first version will just set a rate/delay/loss/corrupt/dupe/reorder discipline on the device root. It would be interesting to be able to do arbitrary disciplines, but maybe that's better left to the workload. The caveat will be any workload that mucks with the qdisc will have to attach to this qdisc as a parent, unless I can figure out some way to be non-interfering.

chrisy commented 11 months ago

@Cerebus I know this is an old thread, but it's still open. :) Were you able to make any progress on this feature?