open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.4k forks source link

[receiver_creator/k8s] Add support for generating receivers based on provided annotations' hints #34427

Open ChrsMark opened 3 months ago

ChrsMark commented 3 months ago

Component(s)

receiver/receivercreator

Prior related proposal: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17418

Is your feature request related to a problem? Please describe.

This issue proposes to add support for a hints/suggestions based autodiscovery mechanism for the receivercreator. This was initially discussed at https://cloud-native.slack.com/archives/C01N5UCHTEH/p1718893771362289.

Describe the solution you'd like

The goal is to provide support to the users for automatically enabling receivers by annotating their Pods properly:

apiVersion: v1
kind: Pod
metadata:
  name: annotations-hints-demo
  annotations:
    otel.hints/receiver: redis/1
    otel.hints/endpoint: '`endpoint`:6379'
spec:
  containers:
  - name: redis
    image: redis:latest
    ports:
    - containerPort: 6379

Inspiration comes from Filebeat's/Metricbeat's and Fluentbit's respective features.

Describe alternatives you've considered

While such a dynamic workload monitoring approach is already supported by the receivercreator and the k8s_observer, it still requires the users to have already predefined the receivercreator's sub-configs. This means that if we want to later target and discover a different workload we need to extend the Collector's config and restart the Collector's instances.

Providing the option to achieve the same result seamlessly would ease the experience for both the application developers who deploy workloads on K8s but also for the admins of the K8s cluster and/or the OTel based observability stack.

Additional context

Feature feasibility

I have already tried that out and it seems that the receivercreator can indeed be extended to support such a feature.

There might be a need to distinguish the implementation for logs and metrics respectively, to best cover the default behaviors (i.e. we want to parse all logs from containers by default in a generic way unless we are hinted to use a technology specific "input". One thing to cover here would be to ensure that the feature would play well with a future templates provider)

A very dirty PoC can be found at https://github.com/open-telemetry/opentelemetry-collector-contrib/commit/4591ecfaaddb62c24f1bfe004d33ba4b4c0cb9cf. This is just to illustrate the idea for now. Here is how it works:

values.yaml for installing the Collector's Helm Chart:

mode: daemonset

image:
  repository: otelcontribcol-dev
  tag: "latest"
  pullPolicy: Never

clusterRole:
  create: true
  rules:
    - apiGroups:
        - ''
      resources:
        - 'pods'
      verbs:
        - 'get'
        - 'list'
        - 'watch'
config:
  extensions:
    k8s_observer:
      auth_type: serviceAccount
      node: ${env:K8S_NODE_NAME}
      observe_pods: true
  exporters:
    debug:
      verbosity: detailed
  receivers:
    receiver_creator:
      watch_observers: [ k8s_observer ]
      receivers:
  service:
    extensions: [health_check, k8s_observer]
    pipelines:
      traces:
        processors: [batch]
        exporters: [debug]
      metrics:
        receivers: [ receiver_creator]
        processors: [batch]
        exporters: [debug]
      logs:
        processors: [batch]
        exporters: [debug]

Target Redis Pod:

apiVersion: v1
kind: Pod
metadata:
  name: redis
  annotations:
    otel.hints/receiver: redis/1
    otel.hints/endpoint: '`endpoint`:6379'
  labels:
    app: redis
spec:
  containers:
  - image: redis
    imagePullPolicy: IfNotPresent
    name: redis
    ports:
    - name: redis
      containerPort: 6379
      protocol: TCP

Collector's output:

2024-08-06T13:23:17.509Z    info    receivercreator@v0.103.0/observerhandler.go:86  handling added endpoint {"kind": "receiver", "name": "receiver_creator", "data_type": "metrics", "env": {"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"metadata\":{\"annotations\":{\"otel.hints/endpoint\":\"`endpoint`:6379\",\"otel.hints/receiver\":\"redis/1\"},\"labels\":{\"app\":\"redis\",\"k8s-app\":\"redis\"},\"name\":\"redis\",\"namespace\":\"default\"},\"spec\":{\"containers\":[{\"image\":\"redis\",\"imagePullPolicy\":\"IfNotPresent\",\"name\":\"redis\",\"ports\":[{\"containerPort\":6379,\"name\":\"redis\",\"protocol\":\"TCP\"}]}]}}\n","otel.hints/endpoint":"`endpoint`:6379","otel.hints/receiver":"redis/1"},"endpoint":"10.244.0.31","id":"k8s_observer/3176770e-5e14-4f7b-9146-0df6a445bb6c","labels":{"app":"redis","k8s-app":"redis"},"name":"redis","namespace":"default","type":"pod","uid":"3176770e-5e14-4f7b-9146-0df6a445bb6c"}}
2024-08-06T13:23:17.509Z    warn    receivercreator@v0.103.0/observerhandler.go:100 handling added hinted receiver  {"kind": "receiver", "name": "receiver_creator", "data_type": "metrics", "subreceiverKey": "redis/1"}
2024-08-06T13:23:17.509Z    info    receivercreator@v0.103.0/observerhandler.go:136 starting receiver   {"kind": "receiver", "name": "receiver_creator", "data_type": "metrics", "name": "redis/1", "endpoint": "10.244.0.31", "endpoint_id": "k8s_observer/3176770e-5e14-4f7b-9146-0df6a445bb6c", "config": {"collection_interval":"10s","endpoint":"`endpoint`:6379"}}
2024-08-06T13:23:18.572Z    info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 26, "data points": 31}
2024-08-06T13:23:18.572Z    info    ResourceMetrics #0
Resource SchemaURL: 
Resource attributes:
     -> redis.version: Str(7.4.0)
     -> k8s.namespace.name: Str(default)
     -> k8s.pod.name: Str(redis)
     -> k8s.pod.uid: Str(3176770e-5e14-4f7b-9146-0df6a445bb6c)
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope otelcol/redisreceiver 0.103.0-dev
Metric #0
Descriptor:
     -> Name: redis.clients.blocked
     -> Description: Number of clients pending on a blocking call
     -> Unit: {client}
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
StartTimestamp: 2024-08-06 13:23:16.512548474 +0000 UTC
Timestamp: 2024-08-06 13:23:18.512548474 +0000 UTC
Value: 0
Metric #1
Descriptor:
     -> Name: redis.clients.connected
     -> Description: Number of client connections (excluding connections from replicas)
     -> Unit: {client}
     -> DataType: Sum
     -> IsMonotonic: false
     -> AggregationTemporality: Cumulative
NumberDataPoints #0
....

Specification

Prerequisite: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35544

In the context of this feature we can focus on 2 primary goals: 1) Collect Metrics from dynamically discovered workloads 2) Collect Logs from dynamically discovered workloads

For Pods with mutliple containers we need to ensure that we collect logs from each one of them and that we can scope the metrics collection accordingly per container/port.

Collect Metrics dynamically from workloads

⚠️ To support use cases where a Pod consists of multiple containers the configurations generated by metrics' hints will be of type: port.

Supported hints:

  1. io.opentelemetry.collector.receiver-creator.metrics/receiver: redis
  2. io.opentelemetry.collector.receiver-creator.metrics/endpoint: "`endpoint`:6379"
  3. io.opentelemetry.collector.receiver-creator.metrics/username: "admin"
  4. io.opentelemetry.collector.receiver-creator.metrics/password: "passpass"
  5. io.opentelemetry.collector.receiver-creator.metrics/collection_interval: 10s
  6. io.opentelemetry.collector.receiver-creator.metrics/timeout: 1m

With some more advanced options:

  1. io.opentelemetry.collector.receiver-creator.metrics/raw: '' To support users that want to provide more advanced configurations.

Extra Additionally we can support defining hints per Pod's container/port i.e.: io.opentelemetry.collector.receiver-creator.metrics.<port_name>/receiver: redis. To support use cases where a Pod consists of multiple containers and we want to specify a specific port

Collect Logs dynamically from workloads

⚠️ To support use cases where a Pod consists of multiple containers the configurations generated by logs' hints will be of type: pod.container (see https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35544).

Supported hints:

  1. io.opentelemetry.collector.receiver-creator.logs/enabled: true -> This will enable a default filelog receiver for the specific path
  2. io.opentelemetry.collector.receiver-creator.logs/template: redis -> If/When we will have support for technology specific templates(https://github.com/open-telemetry/opentelemetry-collector/issues/8372)
  3. io.opentelemetry.collector.receiver-creator.logs/raw: ''

We could expose more settings of filelogreceiver through annotations. For example io.opentelemetry.collector.receiver-creator.logs/max_log_size, io.opentelemetry.collector.receiver-creator.logs/operators

Extra Additionally we can support defining hints per Pod's container/port i.e.: io.opentelemetry.collector.receiver-creator.logs.<container_name>/operators: redis

Next steps

I would like to get feedback from the community/SIG on this. Some pieces might be missing from this proposal, so any kind of feedback would be helpful here :). Eventually, in case we find this a good fit for the Collector I'd be happy to contribute on its implementation (and future support of it).

We could potentially break this up to 2 different scopes, one for the metrics use-case and one for the logs use-case since the requirements are a bit different in each one of them.

github-actions[bot] commented 3 months ago

Pinging code owners:

github-actions[bot] commented 3 months ago

Pinging code owners for extension/observer/k8sobserver: @rmfitzpatrick @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

rogercoll commented 3 months ago

That's a really interesting idea and I reckon it would be aligned with the auto instrumentation approach in OpenTelemetry. The K8s Operator is already available to inject auto-instrumentation for some SDKs, maybe we could reuse some of the annotations naming used there:

annotations:
    instrumentation.opentelemetry.io/inject-collector-receiver: redis/1
    instrumentation.opentelemetry.io/inject-collector-receiver-config: '`endpoint`:6379'

As I understood, the annotations will be on Pod spec level, if a Pod contains more than one container to be instrumented, how could we differentiate their different receiver's configurations? What do you think if we follow a similar approach as Metricbeat but the user needs to append a suffix to the annotation key?

annotations:
    instrumentation.opentelemetry.io/inject-collector-receiver: redis/1
    instrumentation.opentelemetry.io/inject-collector-receiver-config: '`endpoint`:6379'
     instrumentation.opentelemetry.io/inject-collector-receiver/2: redis/2
    instrumentation.opentelemetry.io/inject-collector-receiver-config/2: '`endpoint`:1234'

Sets of annotations can be provided with numeric or text prefixes. If there are annotations that don’t have a suffix, then they get grouped together into a single configuration. I think having a common prefix would ease the annotations detection and parsing.

ChrsMark commented 2 months ago

There is already an issue suggested this feature: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/17418. If we agree on this we can continue with one of these 2.

As I understood, the annotations will be on Pod spec level, if a Pod contains more than one container to be instrumented, how could we differentiate their different receiver's configurations? What do you think if we follow a similar approach as Metricbeat but the user needs to append a suffix to the annotation key?

@rogercoll I would expect this to be possible if we somehow handle the hints per container/port like type == "port" and port == xyz as it was described at https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/25866#issuecomment-1683004547. But good point anyways, and we should take into account in the implementation.

Update: For covering this for Logs we will need https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/35491

ChrsMark commented 2 months ago

Hey @dmitryax , following our offline discussion I have put together the part for how the hints' API would look like. It's the Specification part in the description which lists the hints we could support.

I would suggest we start with the metrics use-case.

ChrsMark commented 1 month ago

@rogercoll @dmitryax I tried my hand at implementing the proposal for metrics: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35617 (logs can be added in a different Pr since it's anyways blocked on https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35544). That's more or less how I imagine the implementation but I still want to play a bit around with some details. However if you could take a look and provide your feedback that would be great so to ensure that we are aligned on the direction.

rogercoll commented 1 month ago

@ChrsMark I have taken a look at the https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35617, the code looks good to me, but I have a concern regarding the extensibility of the Supported metrics annotations.

The PR seems to be tied to specific configuration annotations, meaning that I can only "discover" the receivers that use those specific fields. The PR's supported annotations (receiver configuration) are endpoint, timeout, user, and password, which seem to be some of the configuration fields available in the nginx or redis receivers.

Do we plan to only support specific configuration annotations? Would it be feasible to just support the required ones for the k8sobserver logic's and use the other's as raw configuration? Something like:

// well-known annotations
io.opentelemetry.collector.receiver-creator.metrics/receiver: nginx
io.opentelemetry.collector.receiver-creator.metrics/endpoint: "http://`endpoint`/nginx_status"

// any other annotation will be used as raw configuration field
io.opentelemetry.collector.receiver-creator.metrics/collection_interval: '20s'
io.opentelemetry.collector.receiver-creator.metrics/initial_delay: '20s'
io.opentelemetry.collector.receiver-creator.metrics/read_buffer_size: '10'
io.opentelemetry.collector.receiver-creator.metrics/xyz: 'abc'

will be mapped to:

 nginx:
  endpoint: "http://`endpoint`/nginx_status"
  collection_interval: 20s
  initial_delay: 20s
  read_buffer_size: 10
  xyz: abc
ChrsMark commented 1 month ago

Do we plan to only support specific configuration annotations? Would it be feasible to just support the required ones for the k8sobserver logic's and use the other's as raw configuration?

My proposal would be to start supporting specific settings as annotations so as to have full control of what is allowed as a hint. This would serve as making it straight-forward for the users what specific annotations are supported as well as ensuring no security sensitive settings are exposed over this API. Every user with access to deploy Pods can potentially configure the Collector and take advantage of its privileges by just deploying a hinted Pod.

In this, I would only allow io.opentelemetry.collector.receiver-creator.metrics/raw: '' for advanced use-cases which should also be enabled-disabled (disabled by default) on Collector's startup.

Would love to hear @dmitryax's thoughts on this as well.

dmitryax commented 1 month ago

I have some concerns about the suggested naming convention. Let's brainstorm it more.

  1. Why does it have to be reverse domain notation?
  2. Kubernetes has 63 character limit on the keys. We need it to be shorter.
  3. I don't think we should include receiver-creator or receiver parts in it because we potentially can get rid of them, see https://github.com/open-telemetry/opentelemetry-collector/issues/11238. I would use scraper term instead.
  4. The config parts should not be mixed with root level keys like nginx to avoid potential conflicts.

What about something like this (I'm open to further discussion):

// once we have proper discovery, this would control particular pods turning it off/on
opentelemetry.io/discovery/enabled: "true"

// this one should be required as of right now, until we have the discovery
opentelemetry.io/discovery/scraper: nginx

// extra annotations to control scraper behavior
opentelemetry.io/discovery/signals: metrics
opentelemetry.io/discovery/config/endpoint: "http://`endpoint`/nginx_status"
opentelemetry.io/discovery/config/collection_interval: '20s'
opentelemetry.io/discovery/config/initial_delay: '20s'
opentelemetry.io/discovery/config/read_buffer_size: '10'
opentelemetry.io/discovery/config/xyz: 'abc'
ChrsMark commented 1 month ago

Thank's @dmitryax!

Regarding the 1st point, in case we want to use the same set of hints with docker it is suggested to use reverse dns naming convention: https://docs.docker.com/engine/manage-resources/labels/#key-format-recommendations

I'm fine with 2 and 3 points.

However K8s does not allow multiple /, so our annotations should only contain one.

I think your suggestion would cover the needs of this feature and is quite future compliant with the potential proper discovery.

2 questions: 1) Should we only expose specific settings through the annotations or allow everything? See https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34427#issuecomment-2400208585 2) How about having an additional annotation category for container logs? opentelemetry.io.discovery.container_logs/enabled: "true". This would indicate that a filelog receiver instance will start collecting Pod's containers' logs.

// once we have proper discovery, this would control particular pods turning it off/on
opentelemetry.io.discovery/enabled: "true"

// this one should be required as of right now, until we have the discovery
opentelemetry.io.discovery/scraper: nginx

// extra annotations to control scraper behavior
opentelemetry.io.discovery/signals: "metrics,logs"
opentelemetry.io.discovery/config.endpoint: "http://`endpoint`/nginx_status"
opentelemetry.io.discovery/config.collection_interval: '20s'
opentelemetry.io.discovery/config.initial_delay: '20s'
opentelemetry.io.discovery/config.read_buffer_size: '10'
opentelemetry.io.discovery/config.xyz: 'abc'

// nested configs
opentelemetry.io/discovery/config.nested: |
  zyz: foo
  bar: baz

// enable container logs collection
opentelemetry.io.discovery.container_logs/enabled: "true"
opentelemetry.io.discovery.container_logs/config.max_log_size: "2MiB"
opentelemetry.io.discovery.container_logs/config.foobar: "xyz"
opentelemetry.io.discovery.container_logs/config.operators: |
  - type: regex_parser
    regex: '^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<sev>[A-Z]*) (?P<msg>.*)$'

We should also take into account the multiple containers case which would be covered by including the name of the port in the annotation for scraper:

opentelemetry.io.discovery.<container_port>/scraper: nginx
opentelemetry.io.discovery.<container_port>/config.endpoint: "http://`endpoint`/nginx_status"

and the name of the container for container_logs:

opentelemetry.io.discovery.container_logs.<container_name>/enabled: "true"
opentelemetry.io.discovery.container_logs.<container_name>/config.max_log_size: "2MiB"
ChrsMark commented 1 month ago

@dmitryax @rogercoll I have update the PR based on the above discussions. The approach looks legit, I'll keep testing more scenarios.

Let me know if there is anything that does not make sense or missing in the above proposal and/or feel free to check the PR directly.

dmitryax commented 1 month ago

Let's try not to restrict users from achieving particular configuration use cases but don't provide different ways to achieve the same results. Otherwise, it'll be much harder to maintain and introduce any changes to the interface if we ever have to do that.

We can stick with the reverse domain notation if we have the one / restriction and put the discovery-related parts before that.

What about starting with supporting the following:

io.opentelemetry.discovery.metrics/enabled: "true"
io.opentelemetry.discovery.logs/enabled: "true"
io.opentelemetry.discovery.metrics.8888/enabled: "false" // takes priority
io.opentelemetry.discovery.log.my-container/enabled: "false" // takes priority

io.opentelemetry.discovery.metrics/scraper: "nginx"
io.opentelemetry.discovery.metrics/config: |
  endpoint: "http://`endpoint`/nginx_status"
  collection_interval: '20s'
  initial_delay: '20s'
  read_buffer_size: '10'
  xyz: 'abc'

io.opentelemetry.discovery.metrics.8888/scraper: "prometheus" // takes priority

Is there any limit on size of the annotation values in k8s so we can provide the whole config easily?

ChrsMark commented 1 month ago

Thank's @dmitryax!

Why do we need to provide annotation per a config option AND the nested config? Can we have only one?

I thought that providing support for nested configurations was a nice to have extension in addition to the base option which is to provide one config option per line/annotation. From my perspective it's more user-friendly to provide "one-liners" like io.opentelemetry.discovery/config.endpoint: "http://`endpoint`/nginx_status" in case you only want to define one single setting. Then, in case there is need for multi-line config settings this should also be allowed.

However since it was also found confusing at https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/35617#discussion_r1810268608 I think we can proceed with one single approach and only support providing the whole config for now.

Is there any limit on size of the annotation values in k8s so we can provide the whole config easily?

The validation code only restricts the total size of annotations to 256kB which equals to 262144 characters. I guess we should be fine with that fair limit.

If we keep metrics and logs (I don't think we need container_ prefix) as a part of the key, let's make it the only way. We don't need the .../signals key in that case.

I'm fine with the io.opnetelemetry.discovery/signals key for tuning the scraper's signals. I think it's more tidy and makes the implementation simpler.

However, the logs signal of the scraper does not indicate necessarily that the container's logs will be collected from stdout/file. That's why I propose the io.opentelemetry.discovery.container_logs/enabled: "true" annotation which can be added extra to a Pod that has io.opentelemetry.discovery/signals: "metrics,logs". In this, additionally to the configured scraper (which can collect metrics and logs signals) a filelog receiver with the container_parser enabled will get started in order to collect the container's logs properly. Would that make sense?

As first step we can only support scraper annotations like the following:

io.opentelemetry.discovery/enabled
io.opentelemetry.discovery/signals
io.opentelemetry.discovery/scraper
io.opentelemetry.discovery/config

io.opentelemetry.discovery.80/scraper # takes priority

Then we can consider the additional opinionated hint/suggestion to collect container logs through the filelog receiver in follow up PR/issue. PR tuned accordingly @dmitryax, let me know what you think.

dmitryax commented 1 month ago

From my perspective it's more user-friendly to provide "one-liners" like io.opentelemetry.discovery/config.endpoint: "http://`endpoint`/nginx_status" in case you only want to define one single setting.

io.opentelemetry.discovery/config: "endpoint: http://`endpoint`/nginx_status" is still pretty user friendly

I'm fine with the io.opnetelemetry.discovery/signals key for tuning the scraper's signals. I think it's more tidy and makes the implementation simpler.

It's unlikely that logs (once we have it) and metrics will be collected by the same receiver/scraper. So we won't able to reuse io.opentelemetry.discovery/config when users want to configure collecting both signals. Also io.opentelemetry.discovery/scraper likely won't be applicable to logs collection. So I think we have to keep the <signal-type> as part of the annotation key and don't introduce io.opentelemetry.discovery/signals

ChrsMark commented 1 month ago

So @dmitryax from your previous proposal the assumption is that io.opentelemetry.discovery.logs/enabled: "true" spawns a filelog receiver with container parser enabled?

(with only metrics signal supporting the io.opentelemetry.discovery/scraper)

So in that case I believe it makes sense, yes, and we can conclude with the following:

io.opentelemetry.discovery.metrics/enabled: "true"
io.opentelemetry.discovery.logs/enabled: "true"
io.opentelemetry.discovery.metrics.8888/enabled: "false" // takes priority

io.opentelemetry.discovery.log.my-container/enabled: "false" // takes priority

io.opentelemetry.discovery.logs/config: | # additional filelog receiver configuration here
  max_log_size: "2MiB"
  operators:
  - type: regex_parser
    regex: '^(?P<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (?P<sev>[A-Z]*) (?P<msg>.*)$'

io.opentelemetry.discovery.metrics/scraper: "nginx"
io.opentelemetry.discovery.metrics/config: |
  endpoint: "http://`endpoint`/nginx_status"
  collection_interval: '20s'
  initial_delay: '20s'
  read_buffer_size: '10'
  xyz: 'abc'

io.opentelemetry.discovery.metrics.8888/scraper: "prometheus" // takes priority
dmitryax commented 1 month ago

@ChrsMark, looks good to me! 👍

ChrsMark commented 1 month ago

Thank's @dmitryax @rogercoll for the feedback here!

I have updated the PR and the logic seems to work quite smooth so far.

I will keep testing (and possibly add more tests) but I would appreciate any reviews there already, mainly to validate the receiver_creator's configuration approach (how the feature is turned on/off and its defaults).

ChrsMark commented 3 days ago

Since the metrics part has been merged I have opened a PR that adds support for the logs part based on https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34427#issuecomment-2436140051.