projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.99k stars 1.33k forks source link

Policy log action impacting network latency #9185

Closed lispyclouds closed 1 month ago

lispyclouds commented 1 month ago

We are big users of Calico in my workplace which deploys it at a pretty large scale. We deploy some compliance and security rules into all of our clusters and here is a summary of what we are seeing. This is something that's impacting our latency sensitive production envs and we are working on this ourselves and I'm asking this here in case anyone here sees some thing obvious that we missed. (we are a bit new to calico internals). We used both 3.25 (in prod) and 3.28 (in dev clusters).

Expected Behaviour

The performance of our network calls should be consistent all the time, given all the policies we have. Specially the DNS performance.

Current Behaviour

We see the following:

Possible Solution

Two things seem to help as of now:

Steps to Reproduce (for bugs)

Context

As described above. We are happy to provide more details and hoping for some direction.

Your Environment

lispyclouds commented 1 month ago

Update:

We managed to narrow this down further:

Based on this, we can reword the ask in the following way:

viksesh commented 1 month ago

I too have noticed this in our enterprise implementation of calico cni and policy design. Few more questions for community experts:

  1. How can we further debug to narrow down potential inefficient calico policies?
  2. What sort of latency should be expected with calico policies?
fasaxc commented 1 month ago

How many policies do you have? How many rules in each policy? Are you auto-generating policy from some external model, that's easy to get wrong and produce a lot of inefficient policies.

How many iptables rules do you have

iptables-save | wc -l

Rule-of-thumb latency of a single iptables rule is 500ns so either you have 200k rules applied to each pod (which is about 100x too much) or there's a kernel issue slowing things down. Either a kernel bug or some secondary (BPF?) monitoring tool that's getting overwhelmed.

The general rule for efficient policy is to apply each policy to as few workloads as possible. Use spec.selector to limit where the policy applies, instead of using a source and dest selector in the same policy rule.

fasaxc commented 1 month ago

@viksesh are you also on RH8? Please can you share your kernel versions

lispyclouds commented 1 month ago

@fasaxc Here are some more updates from a further round of investigations and testing and I think we have zeroed in on the issue:

Are there any suggestions as to how to get the logs to happen with expected perf? Feels like this could be a solved problem already.

tomastigera commented 1 month ago

You are right, that swapping the log and allow actions would not work.

After a Log action, processing continues with the next rule; Allow and Deny are immediate and final and no further rules are processed.

It is expected that excessive logging of traffic with high frequency of new connections impacts performance. The question is what are you trying to achieve with this logging. I doubt it is just for debugging. Is there some other approach that would help you to get the same information more cheaply? For instance, you can scan conntrack to see allowed connections. Could you reverse the rules to log only if it would be denied? Could you tweak syslog?

If logrus would be the issue, it likely does not impact the traffic directly. It could impact cpu usage. Latency is more likely impacted by the LOG actions of the iptables rules and emitting this logs into syslog or some other sink.

fasaxc commented 1 month ago

The Log action is intended as a basic diagnostic tool for debugging your policy. You're not supposed to log every flow using that mechanism. In iptables mode, the Log action maps to an iptables LOG action, which goes directly to syslog/kernel log, without any involvement of the Calico daemon (so it doesn't go through logrus).

It sounds like what you're really looking for is something like Calico Enterprise/Calico Cloud, which has rich observability and flow logging capabilities (with high performance). A key part of that is being able to aggregate flows to show the big picture rather than giving a firehose of logs.

lispyclouds commented 1 month ago

Thanks a lot for all the valuable info @fasaxc and @tomastigera we would be following your recommendations.