stripe / veneur

A distributed, fault-tolerant pipeline for observability data
MIT License
1.73k stars 174 forks source link

Veneur facing client timeout for large metric count #1054

Open datsabk opened 1 year ago

datsabk commented 1 year ago

Hello, Despite all the efforts we are stuck with below error whenever there is a high metric count being published by Veneur.

time="2023-04-06T14:59:52Z" level=debug msg="Worker count chosen" metric_sink=datadog workers=4
time="2023-04-06T14:59:52Z" level=debug msg="Chunk size chosen" chunkSize=22197 metric_sink=datadog
time="2023-04-06T14:59:56Z" level=debug msg="POSTed successfully" action=flush endpoint="http://10.30.28.159:8282/api/v1/series?api_key=xxx" metric_sink=datadog request_headers="map[Content-Encoding:[deflate] Content-Type:[application/json] Ot-Tracer-Sampled:[true] Ot-Tracer-Spanid:[3773ac420e1c926a] Ot-Tracer-Traceid:[7a7977bdb348250a]]" request_length=491754 response= response_headers="map[Content-Length:[0] Date:[Thu, 06 Apr 2023 14:59:56 GMT]]" status="200 OK"
time="2023-04-06T15:00:01Z" level=warning msg="Could not execute request" action=flush error="context deadline exceeded (Client.Timeout exceeded while awaiting headers)" host="10.30.28.159:8282" metric_sink=datadog path=/api/v1/series
time="2023-04-06T15:00:01Z" level=warning msg="Could not execute request" action=flush error="context deadline exceeded (Client.Timeout exceeded while awaiting headers)" host="10.30.28.159:8282" metric_sink=datadog path=/api/v1/series
time="2023-04-06T15:00:01Z" level=warning msg="Could not execute request" action=flush error="context deadline exceeded (Client.Timeout exceeded while awaiting headers)" host="10.30.28.159:8282" metric_sink=datadog path=/api/v1/series
time="2023-04-06T15:00:01Z" level=info msg=flushed metric_sink=datadog metrics=88786

We use a local Vector to consume the Datadog metrics using Datadog_agent source in Vector. For lesser metrics count, all runs smooth.

For large metrics count, a few initial requests will be sent successfully (depending on max_flush_body_size) and then it start failing with Client timeout exception. I went through #560 and understand that no fix has been tried for this so far.

My question is simple - What is really causing this timeout ? Veneur or the destination?