Policy log action impacting network latency

lispyclouds commented 1 month ago

We are big users of Calico in my workplace which deploys it at a pretty large scale. We deploy some compliance and security rules into all of our clusters and here is a summary of what we are seeing. This is something that's impacting our latency sensitive production envs and we are working on this ourselves and I'm asking this here in case anyone here sees some thing obvious that we missed. (we are a bit new to calico internals). We used both 3.25 (in prod) and 3.28 (in dev clusters).

Expected Behaviour

The performance of our network calls should be consistent all the time, given all the policies we have. Specially the DNS performance.

Current Behaviour

We see the following:

Initially things are fine, all calls are within our SLAs
After an unknown amount of time and an unknown set of behaviours later we can see that the network performance has severely degraded, specifically the DNS requests. We see it go from the expected 4-5ms to 125ms and upto a second at times. This is breaching our SLAs of completing requests within 10ms
It's specifically to do with DNS as the rest of the call like time to connect, traverse the network etc
We made sure it's not to do with coredns and its metrics report expected behaviour
We looked at calico metrics and we aren't really sure what's good/bad
We are seeing large ipv4_tables on bad clusters vs the good ones
We suspect there is some bad state which accumulates over time

Possible Solution

Two things seem to help as of now:

Uninstall and reinstall the calico helm chart and it resolves the issue and we don't know when it comes back. We ran a load test for multiple days and it doesn't degrade it
Removing all of our policies in calico seems to have an effect. We are trying to narrow it down

Steps to Reproduce (for bugs)

Install calico (we tried 3.25 and 3.28)
This something we are not sure of, after an unknown amount of time later it starts degrading
We see this behaviour on clusters which has been running long enough

Context

As described above. We are happy to provide more details and hoping for some direction.

Your Environment

Calico version: 3.25 and 3.28
Orchestrator version (e.g. kubernetes, mesos, rkt): k8s 1.28 and 1.29
Operating System and version: RHEL8
Link to your project (optional): N/A

lispyclouds commented 1 month ago

Update:

We managed to narrow this down further:

Its specifically to do with the calico polices
When we remove the policies, we immediately see a drop in latency and its back when we install them back
We can say with some certainty that the degrading state over time is a red herring as we also confirmed that when we uninstall and reinstall calico, these policies are not added as they are added in a different way, supporting the theory that the lack of policies drops the latency.

Based on this, we can reword the ask in the following way:

What would be the best practices for writing polices, specially when it comes to performance?
What metric(s) should we be keeping an eye out for when it comes to this, so that we can tackle this proactively in the future.

viksesh commented 1 month ago

I too have noticed this in our enterprise implementation of calico cni and policy design. Few more questions for community experts:

How can we further debug to narrow down potential inefficient calico policies?
What sort of latency should be expected with calico policies?

fasaxc commented 1 month ago

How many policies do you have? How many rules in each policy? Are you auto-generating policy from some external model, that's easy to get wrong and produce a lot of inefficient policies.

How many iptables rules do you have

iptables-save | wc -l

Rule-of-thumb latency of a single iptables rule is 500ns so either you have 200k rules applied to each pod (which is about 100x too much) or there's a kernel issue slowing things down. Either a kernel bug or some secondary (BPF?) monitoring tool that's getting overwhelmed.

The general rule for efficient policy is to apply each policy to as few workloads as possible. Use spec.selector to limit where the policy applies, instead of using a source and dest selector in the same policy rule.

fasaxc commented 1 month ago

@viksesh are you also on RH8? Please can you share your kernel versions

lispyclouds commented 1 month ago

@fasaxc Here are some more updates from a further round of investigations and testing and I think we have zeroed in on the issue:

We don't have many rules like you were thinking.
The rule that causes this is the action: Log rule.
We have a pattern of using the log action followed by the allow action on the kube-system ns which includes the DNS calls.
The behaviour seems to be that when the logging happens it blocks the flow of traffic til the logging completes, indicating that the rule evaluation is sequential.
If we remove the log action, we see an immediate improvement of performance.
We tried reordering the log and allow action to allow then log but that doesn't log anymore as calico behaviour is to discard all rules post the allow according to the docs.
We also noticed that logrus is being used as the lib, maybe buffering and or better async perf could be used somehow?

Are there any suggestions as to how to get the logs to happen with expected perf? Feels like this could be a solved problem already.

tomastigera commented 1 month ago

You are right, that swapping the log and allow actions would not work.

After a Log action, processing continues with the next rule; Allow and Deny are immediate and final and no further rules are processed.

It is expected that excessive logging of traffic with high frequency of new connections impacts performance. The question is what are you trying to achieve with this logging. I doubt it is just for debugging. Is there some other approach that would help you to get the same information more cheaply? For instance, you can scan conntrack to see allowed connections. Could you reverse the rules to log only if it would be denied? Could you tweak syslog?

If logrus would be the issue, it likely does not impact the traffic directly. It could impact cpu usage. Latency is more likely impacted by the LOG actions of the iptables rules and emitting this logs into syslog or some other sink.

fasaxc commented 1 month ago

The Log action is intended as a basic diagnostic tool for debugging your policy. You're not supposed to log every flow using that mechanism. In iptables mode, the Log action maps to an iptables LOG action, which goes directly to syslog/kernel log, without any involvement of the Calico daemon (so it doesn't go through logrus).

It sounds like what you're really looking for is something like Calico Enterprise/Calico Cloud, which has rich observability and flow logging capabilities (with high performance). A key part of that is being able to aggregate flows to show the big picture rather than giving a firehose of logs.

lispyclouds commented 1 month ago

Thanks a lot for all the valuable info @fasaxc and @tomastigera we would be following your recommendations.

projectcalico / calico