signalfx / splunk-otel-collector-chart

Splunk OpenTelemetry Collector for Kubernetes
Apache License 2.0
116 stars 146 forks source link

Provide annotations to override HEC token and endpoint #936

Closed boriscosic closed 12 months ago

boriscosic commented 1 year ago

Is your feature request related to a problem? Please describe.

Hi Splunk,

We currently operate a shared EKS cluster where number of teams can host their applications. On the cluster we have installed splunk-otel-collector with default cluster index, cluster HEC token and Splunk endpoint.

Each team uses their own splunk index by: 1) Attaching their index to cluster HEC token. 2) Annotating pods with splunk.com/index for logs to go to their index.

First issue is that splunk-otel-collector batches everything under one HEC token, so when there is a problem with a single index, the entire batch from the node is dropped https://github.com/signalfx/splunk-otel-collector-chart/issues/935. Our Splunk team also feels that having one HEC token for multiple indexes creates a tighter coupling, and they would like to see HEC token per team or per index. We currently can't support this because the HEC token is provided during chart installation.

Second issue is that some apps send a lot more throughput then others, and in some cases they might require a heavy forwarder. As the endpoint is set at agent level there is currently no way to override that without impacting all the applications. We currently can't support this as the endpoint is provided during chart installation.

Describe the solution you'd like

Currently the HEC token is provided at chart level, would it be possible to override it at pod level with these annotations:

splunk.com/hec-token: "Provide an alternative token then one used during install"
splunk.com/hec-endpoint: "Provide an alternative endpoint for ingestor then one used during the install"

Splunk Otel collector would then use these values instead of the default provided by the installation.

Describe alternatives you've considered

We have some tooling workloads for cluster autoscaling which run in their own node group. We have setup a different splunk-otel-collector chart for that with different resources. This is not scalable as we cannot setup a node group for each team.

The new splunk otel collector also provides the option to explicitly include the logs via splunk.com/include: "true". This is how we are currently collecting logs to avoid unnecessary ingestions, and it lets us turn off logging on pods with incorrect indexes.

Additional context

We provide a shared Kubernetes environment for a number of teams. This is a cost effective and scalable pattern where each team can benefit from shared tooling and support model. However, when one pod can indirectly impact the logging of another pod by simply providing a bad index it bring into question the noisy neighbour problem. Logs are critical to the operations of teams and our best option right now is to address this issue as soon as it occurs.

atoulme commented 1 year ago

Please consider also opening a Splunk idea: https://ideas.splunk.com/

matthewmodestino commented 12 months ago

Just like using a node group for each team is not scalable, usually neither is an index or token per team. How many teams we talking here?

This would require a hec exporter for every team and will likely result in overly complex agent config, unnecessary increase in agent utilization, and token and index sprawl. Really what your describing is the "sidecar" approach, which while valid, most folks opt to simplify with the node agent approach we use.

The HEC token really shouldn't be treated as an identifier nor does it need to be mapped to an index. It is really only an auth method of the agents sending. I know it sounds logical to separate these things but I think with a closer look you'll see that cluster level is usually the best way to decouple configs (token per cluster not per app team)

Even the index is not controlled by the token..the agent sets the index field so setting the index in the token does nothing. Also once received by Splunk, the index can even be overriden by the indexing pipeline, where further logic and manipulation can be implemented.

Most our customers use the features to set indexes in the agent and let Splunk route accordingly, (annotations features exist because of this), and implement any further manipulation of indexes in Splunk indexing pipeline.

The incorrect index item can also be solved by not setting "allowed indexes" so if the index is incorrect it gets sent to a "lastchanceindex" instead of being rejected.

Thinking it might be best to have a chat with your account team and we can try and help find a good middle ground. I'd be happy to help take a closer look with them, if you tell them holler at me.

boriscosic commented 12 months ago

Thanks @matthewmodestino, that makes a lot more sense now. I will ping our account team and we'll discuss the above suggestion.