Open yoheiueda opened 2 years ago
Hey @yoheiueda , I'm trying to understand your use case a bit better here, but why do you need a second VXLAN tunnel? I don't think it's a use case we normally see so trying to get a better grasp on your issue.
@mgleung
I am working on the development of a new mechanism called Peer Pod VMs. With this mechanism, a new VM instance is created per a pod, and a VXLAN tunnel is established between the worker node VM and the new VM instance for the pod.
Please see this architecture diagram. https://github.com/confidential-containers/cloud-api-adaptor#architecture
This behaviour was a security fix - one way you might try to inject packets into a (vxlan) pod network would be to send vxlan packets direct to a node, so I don't think we can back it out completely.
However, I guess it might be possible to make the rule more specific - only dropping vxlan packets that match the VNI that Calico's using?
A possible option without a code change - as the rule says, it drops packets from host IPs it doesn't know about. It may be possible to add your IPs to the list? Using the externalNodesCIDRList
in felixconfig
It wasn't clear to me from your diagram if the VXLAN packets were coming from a pod or not. If so, you might use allowVXLANPacketsFromWorkloads
(in FelixConfig) to permit that
@lwr20 Thank you very much for the info. I'll try externalNodesCIDRList
as a work around of this issue.
This behaviour was a security fix - one way you might try to inject packets into a (vxlan) pod network would be to send vxlan packets direct to a node, so I don't think we can back it out completely.
Yes, I understand the purpose of the feature. I am saying that the commit https://github.com/projectcalico/calico/commit/cce544613169bd84a8a06f5536abd65cdc3c7ab8 I mentioned previously was not a security fix or a security enhancement.
Each VXLAN tunnel has a VXLAN ID (VNI), so we can distinguish multiple VXLAN tunnels in a single system. Calico uses 4096 for VXLAN ID, so VXLAN tunnels other than 4096 are not relevant to Calico.
The original Calico implementation respected VXLAN ID, and did not interfere VXLAN packets that are not relevant to Calico. The commit https://github.com/projectcalico/calico/commit/cce544613169bd84a8a06f5536abd65cdc3c7ab8 has removed the condition that checks VXLAN IDs, and unconditionally drops VXLAN packets that are not relevant to Calico.
I think this behavior is overkill, and should be fixed not to interfere VXLAN packets that are not relevant to Calico.
It wasn't clear to me from your diagram if the VXLAN packets were coming from a pod or not. If so, you might use allowVXLANPacketsFromWorkloads (in FelixConfig) to permit that
From the viewpoint from Calico, VXLAN packets are not coming from a pod. VXLAN tunnels used in my project is not relevant to Calico network, so Calico should not interfere them.
Calico uses 4096 for VXLAN ID
Doesn't change any of your points, but just to note that this is configurable.
Yes, I think we should do better here. The linked commit wasn't a security fix, although the dropping of VXLAN packets from unknown hosts is intended for security reasons. The linked commit was to fix an issue where we were using an iptables match criteria that was causing crashes on a number of systems.
We should reinstate it behind feature auto-detection and/or a configuration option in order to enable use-cases like this one with multiple VXLAN networks.
I believe Felix already has a featuregate / auto-detection mechanism we might be able to leverage for this.
I've been working on a fix for https://github.com/moby/moby/issues/43382 and noticed the mention link back to here. RedHat decided to deprecate the xt_u32
kernel module in RHEL 8 and remove it in RHEL 9 as a way to nudge users towards migrating to native nftables
. This is an arbitrary limitation imposed by RedHat; iptables-nft
is compatible with all legacy (read: xtables) packet matches which are available for the running kernel. (Though to be honest, I can't entirely fault them. The u32 expression language is not the most pleasant to work with and is lacking a facility to locate data relative to the start of the L4 header on IPv6 packets containing extension headers, whereas nftables supports it natively using @th
raw payload expressions.) Notably, RedHat has not deprecated the xt_bpf
match as far as I can tell. So I wrote a drop-in replacement for Moby's VXLAN VNI match expression using a BPF filter program. Since the u32 match expression Calico used to use is effectively identical to Moby's, you may be able to leverage the work I've done to resolve this issue.
I'd like to suggest that this is reclassified as a bug, not an enhancement - I just spent several hours tracking down why our VXLAN setup (separate from Calico) is broken in one deployment, and it turns out that that deployment has a newer Calico version which now drops all the packets.
To me, this seems like a clear regression, even though it might not affect most typical setups.
Yep. I think that is fair. Reclassified.
If anyone wants to take a stab at reverting that commit. and adding a feature toggle for whether or not the offending match is used, I'd be very happy to review.
I'm interested in this. In a summary, I think these are two ways:
xt_u32
when xt_u32
module is exists in the hostxt_bpf
Which one is better? I prefer to option 2, HDYT?
Just an FYI, @corhere's fix has been merged https://github.com/moby/moby/commit/105b9834fbd24e687a5b5da14af65bb7cb6f016a, and it looks like xt_u32
is tried first, and then xt_bpf
is used as a fallback.
@yoheiueda that's true, but only to avoid switching up required dependencies on unaffected systems in a security patch release. I am currently cooking up a moby/moby PR which drops the xt_u32
version and improves upon the xt_bpf
match program. I'm hoping to also add support for IPv6.
I set up a Kubernetes cluster using Calico with the following configuration.
I also configured a VXLAN tunnel with the same port number but a different VXLAN ID.
Then, Calico drops all packets for the second VXLAN tunnel, even though I am using a different VXLAN ID for it.
Expected Behavior
Calico does not interfere packets for the second VXLAN tunnel.
Current Behavior
Calico drops all packets for the second VXLAN tunnel, since Calico adds the following iptables rule.
This rule is generated here. https://github.com/projectcalico/calico/blob/c041eec3ea3714c3a92b9f992f55af1e42a4842e/felix/rules/static.go#L243-L249
Possible Solution
Revert https://github.com/projectcalico/calico/commit/cce544613169bd84a8a06f5536abd65cdc3c7ab8 if possible. The commit was introduced to fix errors that occur when the
u32
extension is used withiptables-nft
.According to this comment https://github.com/moby/moby/issues/43382#issuecomment-1143240624,
iptables-nft
also supports theu32
extension, when thext_u32
kernel module is loaded. This kernel module is available by Default in Ubuntu, and available in thekernel-modules-extra
package of RHEL.If revering the commit is not possible, providing it as an optional feature is another possible solution.
Steps to Reproduce (for bugs)
ip
command at a worker nodeContext
Your Environment