vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.12k stars 1.6k forks source link

`prometheus_scrape` source hangs indefinitely if the endpoint never responds #14132

Closed jszwedko closed 1 year ago

jszwedko commented 2 years ago

A note for the community

Problem

If the endpoint that the scrape is hitting is slow to respond, Vector seems to patiently wait. I would have expected there to be a (possibly configurable) HTTP timeout.

Or maybe it should time out using scrape_interval_secs?

Configuration

[sources.source0]
endpoints = ["http://localhost:8000/metrics"]
scrape_interval_secs = 5
type = "prometheus_scrape"

[sinks.sink0]
inputs = ["source0"]
target = "stdout"
type = "console"

[sinks.sink0.encoding]
codec = "json"

Version

vector 0.23.0

Debug Output

2022-08-26T21:23:46.786321Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-08-26T21:23:46.786497Z  INFO vector::app: Loading configs. paths=["/tmp/tmp.toml"]
2022-08-26T21:23:46.794367Z  INFO vector::topology::running: Running healthchecks.
2022-08-26T21:23:46.794696Z  INFO vector::topology::builder: Healthcheck: Passed.
2022-08-26T21:23:46.795324Z  INFO vector: Vector has started. debug="false" version="0.23.0" arch="aarch64" build_id="none"
2022-08-26T21:23:46.795615Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.

Example Data

Test server:

import http.server
import socketserver
from time import sleep

PORT = 8000
SLEEP_TIME = 6000

class SlowHandler(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        sleep(SLEEP_TIME)
        http.server.SimpleHTTPRequestHandler.do_GET(self)

Handler = SlowHandler
httpd = socketserver.TCPServer(("", PORT), Handler)

print("serving at port", PORT)  
httpd.serve_forever()

With a dummy file at ./metrics to serve as the response.

Additional Context

No response

References

zamazan4ik commented 2 years ago

Since Vector internally uses hyper as a HTTP client, this lifehack can be used for the timeout implementation: https://github.com/hyperium/hyper/issues/1097#issuecomment-287633760

imrebuild commented 1 year ago

This is causing long waiting time while stoping vector.

ERROR vector_common::shutdown: Source 'xxx' failed to shutdown before deadline. Forcing shutdown.
ERROR vector::topology::running: Failed to gracefully shut down in time. Killing components. components="xxx, grafana_com_prometheus

The scraping also seems to stop working after the timeout sometimes. I didn't see any data even when the endpoint is back to online.