projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.01k stars 1.34k forks source link

Calico unconditionally drops VXLAN packets that are not related to Calico #6752

Open yoheiueda opened 2 years ago

yoheiueda commented 2 years ago

I set up a Kubernetes cluster using Calico with the following configuration.

I also configured a VXLAN tunnel with the same port number but a different VXLAN ID.

Then, Calico drops all packets for the second VXLAN tunnel, even though I am using a different VXLAN ID for it.

Expected Behavior

Calico does not interfere packets for the second VXLAN tunnel.

Current Behavior

Calico drops all packets for the second VXLAN tunnel, since Calico adds the following iptables rule.

-A cali-INPUT -p udp -m comment --comment "cali:c3JpWekSTKWcdGH0" -m comment --comment "Drop IPv4 VXLAN packets from non-whitelisted hosts" -m multiport --dports 4789 -m addrtype --dst-type LOCAL -j DROP

This rule is generated here. https://github.com/projectcalico/calico/blob/c041eec3ea3714c3a92b9f992f55af1e42a4842e/felix/rules/static.go#L243-L249

Possible Solution

Revert https://github.com/projectcalico/calico/commit/cce544613169bd84a8a06f5536abd65cdc3c7ab8 if possible. The commit was introduced to fix errors that occur when the u32 extension is used with iptables-nft.

According to this comment https://github.com/moby/moby/issues/43382#issuecomment-1143240624, iptables-nft also supports the u32 extension, when the xt_u32 kernel module is loaded. This kernel module is available by Default in Ubuntu, and available in the kernel-modules-extra package of RHEL.

If revering the commit is not possible, providing it as an optional feature is another possible solution.

Steps to Reproduce (for bugs)

  1. Set up Kubernetes with Calico as described above
  2. Establish a VXLAN tunnel using the ip command at a worker node
  3. Send PING packets from the other end of the VXLAN tunnel

Context

Your Environment

mgleung commented 2 years ago

Hey @yoheiueda , I'm trying to understand your use case a bit better here, but why do you need a second VXLAN tunnel? I don't think it's a use case we normally see so trying to get a better grasp on your issue.

yoheiueda commented 2 years ago

@mgleung

I am working on the development of a new mechanism called Peer Pod VMs. With this mechanism, a new VM instance is created per a pod, and a VXLAN tunnel is established between the worker node VM and the new VM instance for the pod.

Please see this architecture diagram. https://github.com/confidential-containers/cloud-api-adaptor#architecture

lwr20 commented 1 year ago

This behaviour was a security fix - one way you might try to inject packets into a (vxlan) pod network would be to send vxlan packets direct to a node, so I don't think we can back it out completely.

However, I guess it might be possible to make the rule more specific - only dropping vxlan packets that match the VNI that Calico's using?

A possible option without a code change - as the rule says, it drops packets from host IPs it doesn't know about. It may be possible to add your IPs to the list? Using the externalNodesCIDRList in felixconfig

lwr20 commented 1 year ago

It wasn't clear to me from your diagram if the VXLAN packets were coming from a pod or not. If so, you might use allowVXLANPacketsFromWorkloads (in FelixConfig) to permit that

yoheiueda commented 1 year ago

@lwr20 Thank you very much for the info. I'll try externalNodesCIDRList as a work around of this issue.

This behaviour was a security fix - one way you might try to inject packets into a (vxlan) pod network would be to send vxlan packets direct to a node, so I don't think we can back it out completely.

Yes, I understand the purpose of the feature. I am saying that the commit https://github.com/projectcalico/calico/commit/cce544613169bd84a8a06f5536abd65cdc3c7ab8 I mentioned previously was not a security fix or a security enhancement.

Each VXLAN tunnel has a VXLAN ID (VNI), so we can distinguish multiple VXLAN tunnels in a single system. Calico uses 4096 for VXLAN ID, so VXLAN tunnels other than 4096 are not relevant to Calico.

The original Calico implementation respected VXLAN ID, and did not interfere VXLAN packets that are not relevant to Calico. The commit https://github.com/projectcalico/calico/commit/cce544613169bd84a8a06f5536abd65cdc3c7ab8 has removed the condition that checks VXLAN IDs, and unconditionally drops VXLAN packets that are not relevant to Calico.

I think this behavior is overkill, and should be fixed not to interfere VXLAN packets that are not relevant to Calico.

It wasn't clear to me from your diagram if the VXLAN packets were coming from a pod or not. If so, you might use allowVXLANPacketsFromWorkloads (in FelixConfig) to permit that

From the viewpoint from Calico, VXLAN packets are not coming from a pod. VXLAN tunnels used in my project is not relevant to Calico network, so Calico should not interfere them.

lwr20 commented 1 year ago

Calico uses 4096 for VXLAN ID

Doesn't change any of your points, but just to note that this is configurable.

caseydavenport commented 1 year ago

Yes, I think we should do better here. The linked commit wasn't a security fix, although the dropping of VXLAN packets from unknown hosts is intended for security reasons. The linked commit was to fix an issue where we were using an iptables match criteria that was causing crashes on a number of systems.

We should reinstate it behind feature auto-detection and/or a configuration option in order to enable use-cases like this one with multiple VXLAN networks.

I believe Felix already has a featuregate / auto-detection mechanism we might be able to leverage for this.

corhere commented 1 year ago

I've been working on a fix for https://github.com/moby/moby/issues/43382 and noticed the mention link back to here. RedHat decided to deprecate the xt_u32 kernel module in RHEL 8 and remove it in RHEL 9 as a way to nudge users towards migrating to native nftables. This is an arbitrary limitation imposed by RedHat; iptables-nft is compatible with all legacy (read: xtables) packet matches which are available for the running kernel. (Though to be honest, I can't entirely fault them. The u32 expression language is not the most pleasant to work with and is lacking a facility to locate data relative to the start of the L4 header on IPv6 packets containing extension headers, whereas nftables supports it natively using @th raw payload expressions.) Notably, RedHat has not deprecated the xt_bpf match as far as I can tell. So I wrote a drop-in replacement for Moby's VXLAN VNI match expression using a BPF filter program. Since the u32 match expression Calico used to use is effectively identical to Moby's, you may be able to leverage the work I've done to resolve this issue.

lukasstockner commented 1 year ago

I'd like to suggest that this is reclassified as a bug, not an enhancement - I just spent several hours tracking down why our VXLAN setup (separate from Calico) is broken in one deployment, and it turns out that that deployment has a newer Calico version which now drops all the packets.

To me, this seems like a clear regression, even though it might not affect most typical setups.

caseydavenport commented 1 year ago

Yep. I think that is fair. Reclassified.

caseydavenport commented 1 year ago

If anyone wants to take a stab at reverting that commit. and adding a feature toggle for whether or not the offending match is used, I'd be very happy to review.

cyclinder commented 1 year ago

I'm interested in this. In a summary, I think these are two ways:

Which one is better? I prefer to option 2, HDYT?

yoheiueda commented 1 year ago

Just an FYI, @corhere's fix has been merged https://github.com/moby/moby/commit/105b9834fbd24e687a5b5da14af65bb7cb6f016a, and it looks like xt_u32 is tried first, and then xt_bpf is used as a fallback.

corhere commented 1 year ago

@yoheiueda that's true, but only to avoid switching up required dependencies on unaffected systems in a security patch release. I am currently cooking up a moby/moby PR which drops the xt_u32 version and improves upon the xt_bpf match program. I'm hoping to also add support for IPv6.