[processor/k8sattributes] Excessive heap usage from `k8sattributes` processor after 0.78.0 upgrade

sirianni commented 1 year ago

Component(s)

processor/k8sattributes

What happened?

Description

We recently upgraded the collector to 0.78.0 and are noticing a massive increase in CPU utilization. It seems that the main cause is increase in GC activity. Looking at the heap profile I see k8sattributes processor as a top culprit

(pprof) top15 -cum
Showing nodes accounting for 0, 0% of 235.31MB total
Dropped 133 nodes (cum <= 1.18MB)
Showing top 15 nodes out of 111
      flat  flat%   sum%        cum   cum%
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).object
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).unmarshal
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).value
         0     0%     0%   206.69MB 87.84%  sigs.k8s.io/json/internal/golang/encoding/json.Unmarshal
         0     0%     0%   205.69MB 87.42%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).unmarshal
         0     0%     0%   205.69MB 87.42%  sigs.k8s.io/json.UnmarshalCaseSensitivePreserveInts (inline)
         0     0%     0%   204.69MB 86.99%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0%     0%   204.69MB 86.99%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode
         0     0%     0%   203.69MB 86.57%  sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).array
         0     0%     0%   187.18MB 79.55%  k8s.io/client-go/rest.Result.Into
         0     0%     0%   186.18MB 79.12%  k8s.io/client-go/kubernetes/typed/apps/v1.(*replicaSets).List
         0     0%     0%   182.68MB 77.63%  github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.replicasetListFuncWithSelectors.func1

Collector version

0.78.2

Environment information

Environment

Kubernetes

OpenTelemetry Collector configuration

processors:
  k8sattributes:
    filter:
      node_from_env_var: K8S_NODE_NAME
    pod_association:
    - sources:
      - from: resource_attribute
        name: k8s.pod.uid
    - sources:
      - from: resource_attribute
        name: k8s.namespace.name
      - from: resource_attribute
        name: k8s.pod.name
    - sources:
      - from: resource_attribute
        name: k8s.pod.ip
    - sources:
      - from: connection
    extract:
      labels:
      - key: app.kubernetes.io/name
        from: pod

Log output

No response

Additional context

This appears to be caused by https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/14003 where the processor started tracking replica sets.

CPU usage before/after upgrade

github-actions[bot] commented 1 year ago

Pinging code owners:

processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146

See Adding Labels via Comments if you do not have permissions to add labels yourself.

sirianni commented 1 year ago

Disabling metadata extraction for k8s.deployment.name and k8s.deployment.uid significantly improves CPU utilization

    extract:
      metadata:
      - k8s.namespace.name
      - k8s.pod.name
      - k8s.pod.uid
      - k8s.node.name
      - container.image.name
      - container.image.tag

dmitryax commented 1 year ago

@fatsheep9146, can you please work on this issue and see what allocations we can avoid?

dmitryax commented 1 year ago

@sirianni thanks for reporting!

Disabling metadata extraction for k8s.deployment.name and k8s.deployment.uid significantly improves CPU utilization

Does it go back to the original (pre-upgrade) level?

sirianni commented 1 year ago

Is there a way to get the (perhaps imperfect?) k8s.deployment.name attribute back? We don't really care about the deployment UID but would like to avoid all of this overhead of replicaset bookkeeping.

sirianni commented 1 year ago

Does it go back to the original (pre-upgrade) level?

Yes, it appears so.

Pre-upgrade --> 0.78.x

0.78.x --> 0.78.x with workaround

fatsheep9146 commented 1 year ago

@fatsheep9146, can you please work on this issue and see what allocations we can avoid?

No problem, I will check about this. @dmitryax

dmitryax commented 1 year ago

@fatsheep9146 I think I know what's going on. Eric runs it on the agent with a node selector, so the pods informer has a limited set of pods to watch. The replicaset informer, on the other hand, cannot have a node selector, so we keep a state about all the replicasets in the whole cluster, which can be pretty big.

One option is to get rid of the replicaset informer and populate the pod -> rs -> deployment data right in the Pod struct every time a pod added/updated.

fatsheep9146 commented 1 year ago

The replicaset informer was added due to the issue of , which want to add k8s.deployment.uid support. In that issue, we have discussed about how to produce the k8s.deployment.name

https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/12936#issuecomment-1208944998

So I think maybe we can do like this way

if only 'k8s.deployment.name' is specified, we use the pod struct to parse the deployment name with regex
if 'k8s.deployment.uid' is specified, we use replicaset informer.

@dmitryax WDYT

dmitryax commented 1 year ago

I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer.

fatsheep9146 commented 1 year ago

I don't believe we have to use the replicaset informer. We don't need any dynamic data from it that can be changed without rolling the pod. So we need to watch for pods only, not replicasets. We can fetch replicaset and deployment data only when pod got changed or added. To avoid high-frequency calls to k8s API for a pod in crash loop, we can utilize some lru cache for replicasets instead of the informer.

Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added. I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.

@dmitryax

dmitryax commented 1 year ago

Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.

Yes

I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.

Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit.

fatsheep9146 commented 1 year ago

Ok, I got your point, you mean we don't need cache all replicaset data through informer, but we should use lru cache to cache the replicasets which has pods we already watched. And only update the caches when a new pod is added.

Yes

I thinks when the existing pod is updated, it should not indicate that the replicaset is also updated, because if the replicaset is changed, it should always create new pod and remove old pods.

Not sure I understand this. We have a handler for a new pod, as well as a handler for a pod update. The first one would always be a cache miss for the replicaset cache, so we need to make another call to k8s api to fill it. The second one is likely always a cache hit.

Ok, I basically understand your idea, I will try to fix this in this way.

Speak of LRU cache, do you know any component in collector-contrib has similar implementation that I can import?

@dmitryax

gbbr commented 1 year ago

@fatsheep9146 looks like github.com/hashicorp/golang-lru/simplelru is commonly used in this repository. Seems like a good option.

fatsheep9146 commented 1 year ago

@fatsheep9146 looks like github.com/hashicorp/golang-lru/simplelru is commonly used in this repository. Seems like a good option.

Thanks! @gbbr

swiatekm commented 1 year ago

In our fork of k8sattributes, we've had success reducing memory usage of the informer stores themselves by using SetTransform on the larger informers. https://github.com/SumoLogic/sumologic-otel-collector/pull/593 made a very big difference for us in large clusters.

We've only done it for Pods in a gateway context, but in principle it could be done for every resource type we start informers for. The change is also very simple conceptually, as it's just getting rid of fields we don't use. Think it's worth pursuing @dmitryax if we're concerned about memory usage? I can open a new issue for it.

dmitryax commented 1 year ago

@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode.

@swiatekm-sumo is there any other reason you have this component forked?

swiatekm commented 1 year ago

@swiatekm-sumo sounds like a good idea to reduce the size of the items fetched by the informer. It'd be great if you can upstream it. But we still need to solve this issue to avoid fetching all the replicasets in the agent mode.

My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.

@swiatekm-sumo is there any other reason you have this component forked?

It was originally forked back when the contrib processor only included Pod metadata, I think, though I wasn't here for it.

At this point, the only reasons are the memory usage and some owner metdata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork.

dmitryax commented 1 year ago

My hope was to keep the ReplicaSet footprint low enough that we could just store all of them. Your proposal with the LRU cache and fetching ReplicaSets from the API Server directly sounds significantly more complicated than what we have now.

We can try this first and see if it solves the high memory issue. cc @fatsheep9146

At this point, the only reasons are the memory usage and some owner metadata - I think only CronJob and Service names. I would very much like to upstream those and get rid of the fork.

We have k8s.cronjob.name already. Not sure about service names, given that one pod can serve several services. I believe someone raised that issue, we can reconsider it

sirianni commented 1 year ago

keep the ReplicaSet footprint low enough that we could just store all of them.

I agree with @dmitryax that if the Node (or NodeFromEnvVar) filter is specified then the processor should make a very strong attempt to only store metadata for pods related to that node. We run some very large Kubernetes clusters (600+ nodes) and it's very inefficient and unnecessary for a DaemonSet collector to store metadata for resources not on its local node.

No matter how small the footprint is, it's very desirable to have memory usage be O(# pods on node) instead of O(# pods on cluster), since the former is much more fixed.

Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided.

swiatekm commented 1 year ago

keep the ReplicaSet footprint low enough that we could just store all of them.

I agree with @dmitryax that if the Node (or NodeFromEnvVar) filter is specified then the processor should make a very strong attempt to only store metadata for pods related to that node. We run some very large Kubernetes clusters (600+ nodes) and it's very inefficient and unnecessary for a DaemonSet collector to store metadata for resources not on its local node.

No matter how small the footprint is, it's very desirable to have memory usage be O(# pods on node) instead of O(# pods on cluster), since the former is much more fixed.

Also, when running in agent / DaemonSet mode, it's critical for the collector's footprint to be as low as possible. Storing any unnecessary metadata should be avoided.

This is a good point, and I agree that controlling resource consumption for DaemonSet Pods is very important. For the SetTransform optimization, this isn't an either/or thing - we can do it anyway for Pods, and reduce the memory footprint further even without having enabled any metadata requiring ReplicaSets.

I do want to point out though, that in this case (of the solution discussed in this issue) the agent and gateway requirements are somewhat opposed. The agent running as a DaemonSet wants to only store the metadata relevant to Pods on its Node, and is (presumably, as we haven't tested this) fine doing some synchronous requests to the API Server if that means avoiding having to store global cluster state. On the other hand, the gateway needs the whole cluster state, and is definitely not fine making a single synchronous request for each ReplicaSet in a busy cluster.

In an ideal world, we would use different approaches for these use cases, but that significantly increases the maintenance burden and complexity of the processor. So we might need to compromise one way or another.

Out of curiosity, if we implemented the SetTransform optimizations and achieved lower memory consumption than in 0.77.0, even while storing all ReplicaSets, would you be unhappy with that outcome?

dmitryax commented 1 year ago

@swiatekm-sumo, thanks for pointing this out. I agree differentiating the logic based on the collector's deployment topology is not an ideal solution. Let's apply your optimization both to pods and replica sets informers first and see what the outcome will be. @swiatekm-sumo, do you have a chance to submit a PR given that you did in your fork?

fatsheep9146 commented 1 year ago

I also agree with @swiatekm-sumo, the SetTransform optimization is beneficial to both gateway mode and daemonset mode. I think the optimization for daemonset model is better to be based on the SetTransform optimization.

@dmitryax @sirianni

swiatekm commented 1 year ago

Yep, I can work on it. You want to assign this issue to me or should we track the SetTransform change separately?

dmitryax commented 1 year ago

@swiatekm-sumo thank you! I created https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23226 to track it separately.

dmitryax commented 1 year ago

@sirianni the new version 0.80.0 is out. Can you please try this out and see what is the RAM usage in your cluster with k8s.deployment.name enabled?

sirianni commented 1 year ago

@sirianni the new version 0.80.0 is out. Can you please try this out and see what is the RAM usage in your cluster with k8s.deployment.name enabled?

Yes, we will give it a try this week.

fatsheep9146 commented 1 year ago

ping @sirianni , do you have tried this?

sirianni commented 1 year ago

We are still seeing significantly elevated CPU usage after the upgrade due to memory allocations / GC churn.

I don't think it's quite as bad as before, but it's still elevated enough that we'll have to resort to a workaround of disabling that k8s.deployment.name attribute. Again, this is a daemonset pod that runs co-located with all of our actual applications, so CPU and memory is at a premium.

I realize the code complexity you are dealing with trying to balance "agent mode" vs. "gateway mode", but I would very much appreciate restoring the previous behavior that didn't have this issue. I'd rather have a slightly imperfect k8s.deployment.name attribute (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/12936#issuecomment-1208944998) then this increase in resource utilization.

sirianni commented 1 year ago

$ go tool pprof ./bin/collector http://localhost:1777/debug/pprof/heap
File: collector
Build ID: 79c5900955e9865a447786690c0e370565a2dd56
Type: inuse_space
Time: Jul 11, 2023 at 1:13pm (EDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top15 -cum
Showing nodes accounting for 11.51MB, 23.48% of 49.04MB total
Showing top 15 nodes out of 285
      flat  flat%   sum%        cum   cum%
         0     0%     0%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*DeltaFIFO).Resync
       1MB  2.04%  2.04%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*DeltaFIFO).queueActionLocked
         0     0%  2.04%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*DeltaFIFO).syncKeyLocked
         0     0%  2.04%    11.51MB 23.48%  k8s.io/client-go/tools/cache.(*Reflector).startResync
         0     0%  2.04%       11MB 22.44%  runtime.main
         0     0%  2.04%    10.51MB 21.44%  github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.New.func2
   10.51MB 21.44% 23.48%    10.51MB 21.44%  github.com/open-telemetry/opentelemetry-collector-contrib/processor/k8sattributesprocessor/internal/kube.removeUnnecessaryReplicaSetData (inline)
         0     0% 23.48%        6MB 12.24%  main.main
         0     0% 23.48%     5.53MB 11.29%  runtime.doInit
         0     0% 23.48%     5.52MB 11.25%  k8s.io/apimachinery/pkg/runtime.WithoutVersionDecoder.Decode
         0     0% 23.48%     5.52MB 11.25%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).Decode
         0     0% 23.48%     5.52MB 11.25%  k8s.io/apimachinery/pkg/runtime/serializer/json.(*Serializer).unmarshal
         0     0% 23.48%     5.52MB 11.25%  k8s.io/client-go/tools/cache.(*ListWatch).List
         0     0% 23.48%     5.52MB 11.25%  k8s.io/client-go/tools/cache.(*Reflector).list.func1
         0     0% 23.48%     5.52MB 11.25%  k8s.io/client-go/tools/cache.(*Reflector).list.func1.2

dmitryax commented 1 year ago

I'd attribute the high churn to the creation of new instances in the removeUnnecessaryReplicaSetData function. @swiatekm-sumo, maybe we can reuse exiting instances and cut off the redundant fields instead?

swiatekm commented 1 year ago

I'd attribute the high churn to the creation of new instances in the removeUnnecessaryReplicaSetData function. @swiatekm-sumo, maybe we can reuse exiting instances and cut off the redundant fields instead?

We could. I originally didn't because it's more complicated to reason about and maintain. If we do want to change it, I'd like to see a benchmark showing it actually shows improvement before we merge the change.

swiatekm commented 1 year ago

@sirianni if you don't want to wait until we figure this out and want to upgrade while keeping the old behaviour, I'm pretty sure you can emulate it using the transform processor. You can:

Add k8s.replicaset.name to the metadata attributes in k8sattributes processor config. This does not require the ReplicaSet informer.
Use transform processor to add k8s.deployment.name by doing a regex replace using the same pattern we used to use: ^(.*)-[0-9a-zA-Z]+$
Drop k8s.replicaset.name

sirianni commented 1 year ago

Thanks @swiatekm-sumo !

Even with extraction of k8s.deployment.name disabled, I'm still seeing some flaky behavior on some of our k8s clusters. This timeout seems to indicate that the ReplicaSet informer is still in use....?

Failed to watch *v1.ReplicaSet: failed to list *v1.ReplicaSet: Get "[https://10.253.0.1:443/apis/apps/v1/replicasets?limit=500&resourceVersion=0](https://10.253.0.1/apis/apps/v1/replicasets?limit=500&resourceVersion=0)": dial tcp 10.253.0.1:443: i/o timeout

swiatekm commented 1 year ago

Strange, it should only create that one if you have either k8s.deployment.name or k8s.deployment.uid in your metadata attributes list.

sirianni commented 1 year ago

emulate it using the transform processor

Here is the YAML for anyone else reading

  transform/k8s_attributes:
    metric_statements:
    - context: resource
      statements:
      - set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
      - replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
      - delete_key(attributes, "k8s.replicaset.name")
    trace_statements:
    - context: resource
      statements:
      - set(attributes["k8s.deployment.name"], attributes["k8s.replicaset.name"])
      - replace_pattern(attributes["k8s.deployment.name"], "^(.*)-[0-9a-zA-Z]+$", "$$1")
      - delete_key(attributes, "k8s.replicaset.name")

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth

See Adding Labels via Comments if you do not have permissions to add labels yourself.

matej-g commented 9 months ago

This is still needed, should not be stale.

github-actions[bot] commented 7 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth

See Adding Labels via Comments if you do not have permissions to add labels yourself.

robincw commented 5 months ago

Issue coming up to 1 year birthday 🍰

swiatekm commented 4 months ago

Thinking about this more, I'd be surprised if optimizing the transform function would help much here as suggested by @dmitryax in https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/23067#issuecomment-1631206263. We're really only allocating an additional struct, all the data we put in it is already allocated and shouldn't trigger any copies. My expectation would be that the churn is simply from deserializing ReplicaSet structs in the Watch.

I don't think we can do much about that cost, but maybe we can update less frequently? For Deployment/StatefulSet/DaemonSet names, we only care about the ReplicaSet owner references. While those references are technically mutable, they will only really change due to relatively rare events. So maybe we could set the ReplicaSet watch to only trigger on Create, and let periodic resync take care of any owner reference discrepancies?

swiatekm commented 3 months ago

Having looked into this a bit more, I don't think the K8s API lets us have a watch on only ADDED notifications. We can of course ignore the notification based on type, but at that point we've already deserialized it, and probably paid most of the final cost of processing it.

I think what we could do instead is ask for only metadata updates for ReplicaSets (and probably other resource types as well). client-go has a metadata-only informer, and for a lot of resources, metadata is all we need. That should help with memory usage and GC pressure.

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib