grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Dropped metrics logic #436

Closed Rohlik closed 4 months ago

Rohlik commented 3 years ago

Hello @grobian, We have tens of servers with the same carbon-c-relay config and still some of them have problem and dropping metrics:

[2021-06-22 13:28:23] (MSG) warning: metrics queuing up for proxy10:2443: 25000 metrics (100% of queue size)
[2021-06-22 13:28:23] (MSG) warning: dropped 108 metrics
[2021-06-22 13:29:23] (MSG) warning: metrics queuing up for proxy10:2443: 25000 metrics (100% of queue size)
[2021-06-22 13:29:23] (MSG) warning: dropped 113 metrics
[2021-06-22 13:30:23] (MSG) warning: metrics queuing up for proxy10:2443: 25000 metrics (100% of queue size)
[2021-06-22 13:30:23] (MSG) warning: dropped 112 metrics

Logs during the carbon-c-relay restart:

[2021-06-22 13:31:17] (MSG) shutting down...
[2021-06-22 13:31:17] (MSG) closed listener for tcp 0.0.0.0:2003
[2021-06-22 13:31:18] (MSG) stopped collector
[2021-06-22 13:31:18] (MSG) stopped aggregator
[2021-06-22 13:31:18] (MSG) stopped worker 1 2 3 4 5
[2021-06-22 13:31:19] (MSG) any_of cluster pending 25000 metrics (with 0 failed nodes)
[2021-06-22 13:31:19] (MSG) any_of cluster pending 25000 metrics (with 0 failed nodes)
[2021-06-22 13:31:19] (MSG) any_of cluster pending 25000 metrics (with 0 failed nodes)
[2021-06-22 13:31:20] (MSG) any_of cluster pending 25000 metrics (with 0 failed nodes)
[2021-06-22 13:31:20] (MSG) any_of cluster pending 25000 metrics (with 0 failed nodes)
...

Client config:

cluster to_proxy
  any_of
    proxy07:2443 transport plain ssl
    proxy08:2443 transport plain ssl
    proxy09:2443 transport plain ssl
    proxy10:2443 transport plain ssl
  ;

On client side and also on remote proxy servers we have carbon-c-relay.

I am not fully understand why client carbon-c-relay don't try send metrics to different destination instead of dropping metrics. Can you explain it to me?

PS: Interesting fact is that client node XY have problem just with remote server proxy10 (the first log snippet) and clent node XZ have problem only with proxy08.

grobian commented 3 years ago

Do you have stats about queue sizes on the other targets?

Rohlik commented 3 years ago

Yes, I have. This is a graph from that day (22.6.2021) and it is visible that during that time (~13:30) other carbon-c-relays have queue size bellow limit (25k). graph

grobian commented 3 years ago

Can you zoom in, e.g. from 13:20 to 13:40 to show the queue sizes? Not that this is in any way more of a fact, but what you're looking at are samples, and any peak pressure would cause this. It feels though as if the other queues should have handled the excess.

Rohlik commented 3 years ago

Here we go: image

grobian commented 4 months ago

It seems the relay cannot push out metrics fast enough. It wouldn't surprise me if TLS adds some of it here,

I wonder if the relay is disconnecting here, which would make for constantly connecting again. Is your workload batch-driven, or should there be a constant load of influx metric?

Rohlik commented 4 months ago

I don't have access to that environment anymore 😐, but I can say that the workload is constant for sure.

grobian commented 4 months ago

ok, thanks, I'll close this ticket then because we cannot reproduce nor confirm any more