Open sirianni opened 1 year ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Disabling metadata extraction for k8s.deployment.name
and k8s.deployment.uid
significantly improves CPU utilization
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.pod.uid
- k8s.node.name
- container.image.name
- container.image.tag
@fatsheep9146, can you please work on this issue and see what allocations we can avoid?
@sirianni thanks for reporting!
Disabling metadata extraction for k8s.deployment.name and k8s.deployment.uid significantly improves CPU utilization
Does it go back to the original (pre-upgrade) level?
Is there a way to get the (perhaps imperfect?) k8s.deployment.name
attribute back? We don't really care about the deployment UID but would like to avoid all of this overhead of replicaset bookkeeping.
Does it go back to the original (pre-upgrade) level?
Yes, it appears so.
@fatsheep9146, can you please work on this issue and see what allocations we can avoid?
No problem, I will check about this. @dmitryax
@fatsheep9146 I think I know what's going on. Eric runs it on the agent with a node selector, so the pods informer has a limited set of pods to watch. The replicaset informer, on the other hand, cannot have a node selector, so we keep a state about all the replicasets in the whole cluster, which can be pretty big.
One option is to get rid of the replicaset informer and populate the pod -> rs -> deployment data right in the Pod
struct every time a pod added/updated.
The replicaset informer was added due to the issue of , which want to add k8s.deployment.uid support. In that issue, we have discussed about how to produce the k8s.deployment.name
So I think maybe we can do like this way
@dmitryax WDYT
I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer.
I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer.
Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added. I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.
@dmitryax
Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.
Yes
I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.
Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit.
Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.
Yes
I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.
Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit.
Ok, I basically understand your idea, I will try to fix this in this way.
Speak of LRU cache, do you know any component in collector-contrib has similar implementation that I can import?
@dmitryax
In our fork of k8sattributes, we've had success reducing memory usage of the informer stores themselves by using SetTransform
on the larger informers. https://github.com/SumoLogic/sumologic-otel-collector/pull/593 made a very big difference for us in large clusters.
We've only done it for Pods in a gateway context, but in principle it could be done for every resource type we start informers for. The change is also very simple conceptually, as it's just getting rid of fields we don't use. Think it's worth pursuing @dmitryax if we're concerned about memory usage? I can open a new issue for it.
@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode.
@swiatekm-sumo is there any other reason you have this component forked?
@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode.
My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.
@swiatekm-sumo is there any other reason you have this component forked?
It was originally forked back when the contrib processor only included Pod metadata, I think, though I wasn't here for it.
At this point, the only reasons are the memory usage and some owner metdata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork.
My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.
We can try this first and see if it solves the high memory issue. cc @fatsheep9146
At this point, the only reasons are the memory usage and some owner metadata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork.
We have k8s.cronjob.name
already. Not sure about service names, given that one pod can serve several services. I believe someone raised that issue, we can reconsider it
keep the ReplicaSet footprint low enough that we could just store all of them.
I agree with @dmitryax that if the Node
(or NodeFromEnvVar
) filter is specified then the processor should make a very strong attempt to only store metadata for pods related to that node. We run some very large Kubernetes clusters (600+ nodes) and it's very inefficient and unnecessary for a DaemonSet collector to store metadata for resources not on its local node.
No matter how small the footprint is, it's very desirable to have memory usage be O(# pods on node)
instead of O(# pods on cluster)
, since the former is much more fixed.
Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided.
keep the ReplicaSet footprint low enough that we could just store all of them.
I agree with @dmitryax that if the
Node
(orNodeFromEnvVar
) filter is specified then the processor should make a very strong attempt to only store metadata for pods related to that node. We run some very large Kubernetes clusters (600+ nodes) and it's very inefficient and unnecessary for a DaemonSet collector to store metadata for resources not on its local node.No matter how small the footprint is, it's very desirable to have memory usage be
O(# pods on node)
instead ofO(# pods on cluster)
, since the former is much more fixed.Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided.
This is a good point, and I agree that controlling resource consumption for DaemonSet Pods is very important. For the SetTransform optimization, this isn't an either/or thing - we can do it anyway for Pods, and reduce the memory footprint further even without having enabled any metadata requiring ReplicaSets.
I do want to point out though, that in this case (of the solution discussed in this issue) the agent and gateway requirements are somewhat opposed. The agent running as a DaemonSet wants to only store the metadata relevant to Pods on its Node, and is (presumably, as we haven't tested this) fine doing some synchronous requests to the API Server if that means avoiding having to store global cluster state. On the other hand, the gateway needs the whole cluster state, and is definitely not fine making a single synchronous request for each ReplicaSet in a busy cluster.
In an ideal world, we would use different approaches for these use cases, but that significantly increases the maintenance burden and complexity of the processor. So we might need to compromise one way or another.
Out of curiosity, if we implemented the SetTransform optimizations and achieved lower memory consumption than in 0.77.0, even while storing all ReplicaSets, would you be unhappy with that outcome?
@swiatekm-sumo, thanks for pointing this out. I agree differentiating the logic based on the collector's deployment topology is not an ideal solution. Let's apply your optimization both to pods and replica sets informers first and see what the outcome will be. @swiatekm-sumo, do you have a chance to submit a PR given that you did in your fork?
I also agree with @swiatekm-sumo, the SetTransform
optimization is beneficial to both gateway mode and daemonset mode.
I think the optimization for daemonset model is better to be based on the SetTransform
optimization.
@dmitryax @sirianni
Yep, I can work on it. You want to assign this issue to me or should we track the SetTransform change separately?
@swiatekm-sumo thank you! I created https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23226 to track it separately.
@sirianni the new version 0.80.0 is out. Can you please try this out and see what is the RAM usage in your cluster with k8s.deployment.name
enabled?
@sirianni the new version 0.80.0 is out. Can you please try this out and see what is the RAM usage in your cluster with
k8s.deployment.name
enabled?
Yes, we will give it a try this week.
ping @sirianni , do you have tried this?
We are still seeing significantly elevated CPU usage after the upgrade due to memory allocations / GC churn.
I don't think it's quite as bad as before, but it's still elevated enough that we'll have to resort to a workaround of disabling that k8s.deployment.name
attribute. Again, this is a daemonset pod that runs co-located with all of our actual applications, so CPU and memory is at a premium.
I realize the code complexity you are dealing with trying to balance "agent mode" vs. "gateway mode", but I would very much appreciate restoring the previous behavior that didn't have this issue. I'd rather have a slightly imperfect k8s.deployment.name
attribute (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/12936#issuecomment-1208944998) then this increase in resource utilization.
$ go tool pprof ./bin/collector http://localhost:1777/debug/pprof/heap
File: collector
Build ID: 79c5900955e9865a447786690c0e370565a2dd56
Type: inuse_space
Time: Jul 11, 2023 at 1:13pm (EDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top15 -cum
Showing nodes accounting for 11.51MB, 23.48% of 49.04MB total
Showing top 15 nodes out of 285
flat flat% sum% cum cum%
0 0% 0% 11.51MB 23.48% k8s.io/client-go/tools/cache.(*DeltaFIFO).Resync
1MB 2.04% 2.04% 11.51MB 23.48% k8s.io/client-go/tools/cache.(*DeltaFIFO).queueActionLocked
0 0% 2.04% 11.51MB 23.48% k8s.io/client-go/tools/cache.(*DeltaFIFO).syncKeyLocked
0 0% 2.04% 11.51MB 23.48% k8s.io/client-go/tools/cache.(*Reflector).startResync
0 0% 2.04% 11MB 22.44% runtime.main
0 0% 2.04% 10.51MB 21.44% github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.New.func2
10.51MB 21.44% 23.48% 10.51MB 21.44% github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.removeUnnecessaryReplicaSetData (inline)
0 0% 23.48% 6MB 12.24% main.main
0 0% 23.48% 5.53MB 11.29% runtime.doInit
0 0% 23.48% 5.52MB 11.25% k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
0 0% 23.48% 5.52MB 11.25% k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode
0 0% 23.48% 5.52MB 11.25% k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).unmarshal
0 0% 23.48% 5.52MB 11.25% k8s.io/client-go/tools/cache.(*ListWatch).List
0 0% 23.48% 5.52MB 11.25% k8s.io/client-go/tools/cache.(*Reflector).list.func1
0 0% 23.48% 5.52MB 11.25% k8s.io/client-go/tools/cache.(*Reflector).list.func1.2
I'd attribute the high churn to the creation of new instances in the removeUnnecessaryReplicaSetData
function. @swiatekm-sumo, maybe we can reuse exiting instances and cut off the redundant fields instead?
I'd attribute the high churn to the creation of new instances in the
removeUnnecessaryReplicaSetData
function. @swiatekm-sumo, maybe we can reuse exiting instances and cut off the redundant fields instead?
We could. I originally didn't because it's more complicated to reason about and maintain. If we do want to change it, I'd like to see a benchmark showing it actually shows improvement before we merge the change.
@sirianni if you don't want to wait until we figure this out and want to upgrade while keeping the old behaviour, I'm pretty sure you can emulate it using the transform processor. You can:
k8s.replicaset.name
to the metadata attributes in k8sattributes processor config. This does not require the ReplicaSet informer.k8s.deployment.name
by doing a regex replace using the same pattern we used to use: ^(.*)-[0-9a-zA-Z]+$
k8s.replicaset.name
Thanks @swiatekm-sumo !
Even with extraction of k8s.deployment.name
disabled, I'm still seeing some flaky behavior on some of our k8s clusters. This timeout
seems to indicate that the ReplicaSet
informer is still in use....?
Failed to watch *v1.ReplicaSet: failed to list *v1.ReplicaSet: Get "[https://10.253.0.1:443/apis/apps/v1/replicasets?limit=500&resourceVersion=0](https://10.253.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0)": dial tcp 10.253.0.1:443: i/o timeout
Strange, it should only create that one if you have either k8s.deployment.name
or k8s.deployment.uid
in your metadata attributes list.
emulate it using the transform processor
Here is the YAML for anyone else reading
transform/k8s_attributes:
metric_statements:
- context: resource
statements:
- set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
- replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
- delete_key(attributes, "k8s.replicaset.name")
trace_statements:
- context: resource
statements:
- set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
- replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
- delete_key(attributes, "k8s.replicaset.name")
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This is still needed, should not be stale.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Issue coming up to 1 year birthday 🍰
Thinking about this more, I'd be surprised if optimizing the transform function would help much here as suggested by @dmitryax in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23067#issuecomment-1631206263. We're really only allocating an additional struct, all the data we put in it is already allocated and shouldn't trigger any copies. My expectation would be that the churn is simply from deserializing ReplicaSet structs in the Watch.
I don't think we can do much about that cost, but maybe we can update less frequently? For Deployment/StatefulSet/DaemonSet names, we only care about the ReplicaSet owner references. While those references are technically mutable, they will only really change due to relatively rare events. So maybe we could set the ReplicaSet watch to only trigger on Create, and let periodic resync take care of any owner reference discrepancies?
Having looked into this a bit more, I don't think the K8s API lets us have a watch on only ADDED notifications. We can of course ignore the notification based on type, but at that point we've already deserialized it, and probably paid most of the final cost of processing it.
I think what we could do instead is ask for only metadata updates for ReplicaSets (and probably other resource types as well). client-go has a metadata-only informer, and for a lot of resources, metadata is all we need. That should help with memory usage and GC pressure.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Component(s)
processor/k8sattributes
What happened?
Description
We recently upgraded the collector to 0.78.0 and are noticing a massive increase in CPU utilization. It seems that the main cause is increase in GC activity. Looking at the heap profile I see k8sattributes processor as a top culprit
Collector version
0.78.2
Environment information
Environment
Kubernetes
OpenTelemetry Collector configuration
Log output
No response
Additional context
This appears to be caused by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/14003 where the processor started tracking replica sets.
CPU usage before/after upgrade