open-telemetry / opentelemetry-operator

Kubernetes Operator for OpenTelemetry Collector
Apache License 2.0
1.2k stars 438 forks source link

Target allocator should not track pods in completed phases #3269

Open diurnalist opened 1 month ago

diurnalist commented 1 month ago

Component(s)

target allocator

What happened?

Description

Not sure if this is technically a bug, but it's a bit of a footgun. Prometheus' discovery code will always return all pods regardless of their phase. Because K8s garbage collects pods only once some threshold of garbage is reached, this can mean it tries to scrape several pods that don't exist for a while. In the worst case, the IP for a completed pod might get reused by a fresh pod, and then the target is actually invalid and there are scrape requests issued against a pod that e.g. did not advertise a scrape endpoint.

Steps to Reproduce

Create some pod that otherwise would match a scrape_configs item in your config, but set it up so it terminates early. That pod will keep showing up in the list of targets tracked by the allocator.

Expected Result

We probably shouldn't be trying to scrape metrics from exited pods as they can't even be routed to.

Actual Result

Targets for completed pods hang around until K8s itself cleans them up.

Kubernetes Version

1.21.0

Operator version

0.102.0

Collector version

latest

Environment information

No response

Log output

No response

Additional context

No response

jaronoff97 commented 1 month ago

looking into this rn, things we've discussed in the slack thread: