vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.42k stars 1.51k forks source link

Wait for scrape in graceful shutdown logic of Prometheus Exporter Sink #19952

Open stevenmatthewt opened 6 months ago

stevenmatthewt commented 6 months ago

A note for the community

Use Cases

Currently, the graceful shutdown implementation of the Prometheus Exporter Sink isn't particularly helpful. It will wait for any in-flight HTTP requests to complete, but it doesn't actually wait for additional scrapes to ensure all metrics have been gathered. If you're running Vector in a relatively ephemeral fashion (we run a Vector container in several of our app pods in Kubernetes), this can lead to a significant number of metric samples "dropped". When a pod begins terminating, we wait for our primary container (which is emitting metrics via StatsD) to shutdown gracefully, then we send a SIGTERM to the Vector container. We'd like that Vector container's graceful shutdown to wait until the next Prometheus Scrape before exiting the process.

Attempted Solutions

For our specific example, we have Vector setup with a simple StatsD Source and a Prometheus Sink:

# ./vector.toml

[sources.statsd]
type = "statsd"
address = "0.0.0.0:9125"
mode = "udp"

[sinks.prometheus]
type = "prometheus_exporter"
inputs = [ "statsd" ]
address = "0.0.0.0:9090"

And I've run a Vector server with:

docker run -it \
    -p 9090:9090 \
    -p 9125:9125/udp \
    -e VECTOR_LOG=debug \
    -e "VECTOR_CONFIG_DIR=/etc/vector" \
    -v "$(pwd)/vector.toml:/etc/vector/vector.toml" \
    timberio/vector:0.30.0-distroless-libc

Then, I send some metrics via StatsD:

echo "example:1|c" | nc -w 1 -u 127.0.0.1 9125

And on terminating the Vector server, I get the following logs:

^C
2024-02-26T16:16:52.041221Z  INFO vector::signal: Signal received. signal="SIGINT"
2024-02-26T16:16:52.049671Z  INFO vector: Vector has stopped.
2024-02-26T16:16:52.058207Z DEBUG source{component_kind="source" component_id=statsd component_type=statsd}: vector::topology::builder: Source finished normally.
2024-02-26T16:16:52.059146Z DEBUG source{component_kind="source" component_id=statsd component_type=statsd}: vector::topology::builder: Source pump finished normally.
2024-02-26T16:16:52.059484Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="prometheus, statsd, print, dummy_logs, parse_logs" time_remaining="59 seconds left"
2024-02-26T16:16:52.062126Z DEBUG source{component_kind="source" component_id=statsd component_type=statsd}: vector::topology::builder: Source pump supervisor task finished normally.
2024-02-26T16:16:52.062883Z DEBUG sink{component_kind="sink" component_id=prometheus component_type=prometheus_exporter}: vector::topology::builder: Sink finished normally.
2024-02-26T16:16:52.062943Z DEBUG sink{component_kind="sink" component_id=prometheus component_type=prometheus_exporter}: hyper::server::shutdown: signal received, starting graceful shutdown
2024-02-26T16:16:52.638256Z DEBUG source{component_kind="source" component_id=dummy_logs component_type=demo_logs}: vector::topology::builder: Source finished normally.
2024-02-26T16:16:52.641751Z DEBUG source{component_kind="source" component_id=dummy_logs component_type=demo_logs}: vector::topology::builder: Source pump finished normally.
2024-02-26T16:16:52.643718Z DEBUG source{component_kind="source" component_id=dummy_logs component_type=demo_logs}: vector::topology::builder: Source pump supervisor task finished normally.
2024-02-26T16:16:52.645180Z DEBUG transform{component_kind="transform" component_id=parse_logs component_type=remap}: vector::utilization: utilization=0.01406477907200002
2024-02-26T16:16:52.653573Z DEBUG transform{component_kind="transform" component_id=parse_logs component_type=remap}: vector::topology::builder: Synchronous transform finished normally.
{"appname":"ahmadajmi","facility":"cron","hostname":"some.xn--80aswg","message":"Pretty pretty pretty good","msgid":"ID331","procid":8697,"severity":"debug","timestamp":"2024-02-26T16:16:52.626Z","version":2}
2024-02-26T16:16:52.656514Z DEBUG sink{component_kind="sink" component_id=print component_type=console}: vector::topology::builder: Sink finished normally.

The server begins the graceful shutdown process, but finishes it in under a second and doesn't wait for an additional scrape from our prometheus server. This isn't wrong per-se, but it's not particularly helpful since you can lose up-to one scrape interval of metrics.

Proposal

No response

References

https://github.com/vectordotdev/vector/issues/11405 is similar, but not specific to prometheus exporter. I suspect the async nature of the prometheus exporter is why it isn't fully covered by the fix there.

Version

vector 0.36.0 (aarch64-unknown-linux-gnu a5e48bb 2024-02-13 14:43:11.911392615)

jszwedko commented 6 months ago

Thanks @stevenmatthewt . I can see the request here, but what it means for data to be flushed in "pull-based sinks" like promotheus_exporter is a bit ambiguous. In your case, you are asking for one additional scrape, but there may be other cases where:

For that reason, I don't think we'd always want to wait for just 1 additional scrape. Given the case of waiting for one more scrape doesn't seem uncommon, though, I'd be open to seeing it added as an opt-in feature.

stevenmatthewt commented 6 months ago

Yeah, I was trying to think through the various scenarios this could help with earlier and couldn't come up with a clear single solution that accounted for everything. Some sort of possible configuration to try to handle the most common scenarios would be nice though, and leaving it open to improvement later. Honestly, even just a configurable shutdown delay in seconds would work – that's effectively what we've done already, but the kubernetes configuration for it can be a bit of a pain.