open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.21k stars 439 forks source link

Distributed collector configuration #1906

Open swiatekm opened 1 year ago

swiatekm commented 1 year ago

Note: This issue is intended to state the problem and collect use cases to anchor the design. It's neither a proposal nor even a high-level design doc.

Currently, configuration for a single Collector CR is monolithic. I'd like to explore the idea of allowing it to be defined in a distributed way, possibly by different users. It would be the operator's job to collect and assemble the disparate configuration CRs and create an equivalent collector configuration - much like how prometheus-operator creates a Prometheus configuration based on ServiceMonitors.

Prior art for similar solutions are prometheus operator with its Monitor CRs, or logging-operator.

Broadly speaking, the benefits of doing this could be:

Potential problems doing this that are unique to the otel operator:

Somewhat related issues regarding new CRs for collector configuration: #1477

I'd like to request that anyone who would be interested in this kind of feature, post a comment in this issue describing their use case.

rupeshnemade commented 1 year ago

Based on our products, I feel this would be a much needed feature.

Our setup has 30 kubernetes clusters as of today with more than 4000 nodes and 70K pods. We have a multiple use case which are difficult to implement as of now but will be easier if OTEL has ability to have distributed configuration -

  1. We need dynamic Kafka exporter configuration but as OTEL is purely static config it is very difficult to update the OTEL config dynamically based on different value of Kafka brokers.
  2. Right now OTEL static config makes it tightly coupled to single set of config rules. If other team needs to add their own OTEL rule in different namespace then its not possible as there is no option of distributed config option in OTEL like Prometheus has ServiceMonitor feature with service discovery.

Our teams have growing needs of forwarding logs to their own destination for analysis and reporting and filtering out logs. They need to frequently add/remove the destinations from the pipeline and therefore dynamic configuration is really required to enable it at large scale.

wreed4 commented 1 year ago

This feature would be very advantageous to us. As we grow as a company, it is our desire to move away from a central team needing to know about the many hundreds of other services running on our clusters. Each team that writes a service is responsible for deploying their service and exposing any custom metrics or logs they want to pull off-cluster. We want a central team to manage the pipeline of how those metrics and logs get pushed to our central observability platform, but we do not want the owner of that pipeline to have to know about which endpoint or which logs or which metrics should be forwarded off cluster and which should not.. or what services exist in the first place. As stated in the initial problem statement of this issue, this is very similar to how the prometheus operator works today, and in fact that is what we use today. In order to move to an OTEL based solution and replace prometheus as a forwarding agent, we really require this decentralization ability.

jaronoff97 commented 1 year ago

Thanks everyone for your feedback here. I've come around to this idea and think it would be beneficial to the community @swiatekm-sumo i'm going to self assign and work on this after #1876 is complete. Do you want to collaborate on the design?

lsolovey commented 1 year ago

I totally support this initiative and agree with use-cases already mentioned above.

Another use-case that I'd like to add is ability for developers to manage Tail Sampling configuration. We run hundreds of applications in cluster, with all observability data collected into the centralized platform. We want application developers to be able to configure Tail Sampling policies for their applications without touching OpenTelemetryCollector CRD (which contains a lot of infrastructure-related settings and is managed by the platform team).

frzifus commented 1 year ago

@lsolovey Could you give an example what way of configuration you would expect? Since I am working on a proposal.

frzifus commented 1 year ago

In summary, a good first step would be to separate the configuration of exporters from the collector configuration. I had a conversation about this with @jaronoff97 yesterday. One possibility would be to start with a gateway and exporter CR. Here is an example of how these CRDs relate to each other.

graph TD;
    OpenTelemetryKafkaExporter-->OpenTelemetryExporter;
    OpenTelemetryOtlpExporter-->OpenTelemetryExporter;
    OpenTelemetryExporter-->OpenTelemetryGateway;
    OpenTelemetryExporter-->OpenTelemetryAgent;
    OpenTelemetryAgent-->OpenTelemetryCollector;
    OpenTelemetryGateway-->OpenTelemetryCollector;

Since all these CRDs are based on the OpenTelemetryCollector definition, it seems to me a requirement to support a native yaml configuration.

Once this is done, we can start prototyping the gateway and exporter CRD.

luolong commented 1 year ago

My attempts so far at setting up and configuring OTel Collector Operator have lead me to somewhat similar thoughts mentioned here and #1477.

The Prometheus Operator has the correct idea here, I believe.

There are basically two or three concerns here that would be useful to separate:

pavolloffay commented 8 months ago

I would like to restart this thread with a very simple proposal. The foundation for distributed collector configuration is the config merging feature of the collector. However, merging overrides arrays - proposal for append merging flag https://github.com/open-telemetry/opentelemetry-collector/issues/8754.

Merge of configuration is order dependent (e.g. the order of processors in the pipeline matters). Therefore the proposal is to introduce a new CRD collectorgroup.opentelemetry.io. The CollectorGroup and collector CRs would need to be initially in the same namespace to play well with the k8s RBAC model.

apiVersion: opentelemetry.io/v1beta1
kind: CollectorGroup
metadata:
  name: simplest
spec:
  root: platform-collector
  collectors:
    - name: receivers
    - name: pii-remove-users
    - name: pii-remove-credit-cards
    - name: export-to-vendor
---
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: platform-collector
spec:
  collectorGroup: true
  config:

The operator could do some validation of the collector configs to make sure each config contains only unique components to avoid overrides.

frzifus commented 8 months ago

I like the idea, but Ive a few open questions / thoughts: