What are the best practices for sharing prod and staging OTel workloads in 1 Kubernetes cluster using OTel Collector?

yaihui commented 5 months ago

Hi,

I have 2 environments of my application, staging and production, running in 1 Kubernetes cluster, segregated by different namespaces.

I have already deployed an OTel Collector as a daemonset with OTel Operator, and applied autoinstrumentation CRD on staging namespace, exporting to http://opentelemetry-collector.opentelemetry:4318. On my collector helm chart, I have defined my otlphttp exporter endpoint to point to an ECE APM endpoint instance.

Wanted to seek advice - should the production OpenTelemetry data be sharing the same collector to push to a different export endpoint? Or should I create another collector to isolate the loads in case staging has high load?

I tried to also apply autoinstrumentation CRD on production namespace, but export to a different otlphttp exporter endpoint. I tried to deploy a different OTel Collector daemonset in another namespace so that I can define a different otlphttp exporter endpoint for the production environment to export to, but encountered the below error:

message: '0/6 nodes are available: 1 node(s) didn't have free ports for the
    requested pod ports, 1 node(s) didn't match Pod's node affinity/selector,
    1 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't
    tolerate, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the
    pod didn't tolerate.'

Hence, I would like to find out the best practices for using the OTel Collector where I wish to auto-instrument data from my application, different environment to different exporter endpoints.

TylerHelmuth commented 5 months ago

There are a lot of questions here that essentially boil down to general k8s cluster management, but we'll try to help where we can.

As long as you're using daemonsets you wont be able to deploy 2 collector instances that use the same ports. Daemonsets are designed to run on every node and only 1 on each node can claim port 4318. If you want to run 2 daemonset instances of the collector with both using the otlpreceiver one can use 4318 but the other will need you to specify a different port.

It is up to you whether you want to manage 1 daemonset for both production and staging or 2 daemonsets. Daemonsets are not cheap as they'll consume resources on every node. Managing 2 deployments is also harder. If you choose to use 1 daemonset you can isolate the data/components using different pipelines within the collector, such as traces/prod and traces/staging, but you'll need to scale the size of the collector to handle both env workloads simultaneously.

yaihui commented 5 months ago

Thanks for the response!

Based on collector, for otlp protocols it's either 4317 (grpc) or 4318 (http).

  otlp:
    protocols:
      grpc:
        endpoint: ${env:MY_POD_IP}:4317
      http:
        endpoint: ${env:MY_POD_IP}:4318

Does this mean that if I have to deploy 2 collector instances, the other otlpreceiver for production environment I can only use 4317? Is there any significant difference between these 2 ports?

On the point

If you choose to use 1 daemonset you can isolate the data/components using different pipelines within the collector, such as traces/prod and traces/staging

How would you define in the collector which endpoint for the telemetry data to go to for the different environments since both are forwarding to the same endpoint and port 4318: http://opentelemetry-collector.opentelemetry:4318?

Example of the configuration below:

config:
  exporters:
    otlphttp/staging:
      endpoint: https://example-staging.com:55681
    otlphttp/production:
      endpoint: https://example-production.com:55681
  service:
    pipelines:
      logs/staging:
        exporters:
          - otlphttp/staging
      logs/production:
        exporters:
          - otlphttp/production

How would the collector identify which data belong to staging and production respectively?

swiatekm commented 5 months ago

Even if you could run separate DaemonSets for prod workloads, doing so doesn't provide a lot of meaningful isolation, as they're still running on the same Node, in the same failure domain. Something that people often do with this kind of setup is running production workloads on a separate group of Nodes, therefore physically separating them from other workloads in the cluster. Then you can have one DaemonSet, and a decent amount of isolation.

If you're running a gateway Collector as a Deployment or StatefulSet, there's an argument for having a separate collector for production data there, for similar reasons.

open-telemetry / opentelemetry-operator

What are the best practices for sharing prod and staging OTel workloads in 1 Kubernetes cluster using OTel Collector? #2781