Behavior of any_of doesn't seem inline with docs

reyjrar commented 2 years ago

Potentially my misunderstanding, but here's the docs:

As any_of suggests, when any of the members become unreachable, the remaining available members will immediately receive the full input stream of metrics.

My Configuration

Version: 3.7.4

Local carbon-c-relay on every host listening on localhost:2003

cluster relays 
     any_of
         collector1:2003
         collector2:2003
;
match * send to relays;

Those upstream relays handle getting the data down to go-carbon. I'm currently performing a maintenance operation on those relays. I need to upgrade firmware and they'll be down for a few hours. I'm seeing unexpected behavior with any_of on these local instances.

In this case, I'm shutting down collector1.

What I Expect

Hosts attempt to send metrics to collector1, they notice it's no longer with us, and instead divert all metrics to collector2.

What Happens

Local relays queue metrics to collector1. Based on my observations, it appears to be sticky by metric name, ie my.metric.name.whatever is hashed to collector1, so it just adds that metric to the queue. Those metrics are never re-routed to the up and alive collector2.

This means that some metrics are missing until collector1 comes back online and the queue is flushed. This leads to most graphs showing incomplete data while maintenance is happening.

Again, this could be my misunderstanding, but I was hoping to get the benefit of the failover with a sprinkling of load-balancing. This doesn't appear to be the case. There's a few ways I could work around this:

Use failover instead of any_of, downside is I'm sending all data to a single node instead of load-balancing across multiple nodes for the 99.9999% of the time that I have two available nodes.
Use forward instead of any_of, downside is I need to rearchitect the rest of my carbon-c-relay instances, also doubles the network traffic.

But I'm wondering if this is a bug or my misunderstanding. FWIW, I just performed an upgrade from v2.2 (yes, I know, I'm sorry) to v3.7.4 and the behavior is the same on both versions.

Also, hi @grobian ! :)

grobian commented 2 years ago

Hi @reyjrar (or can I just say 'Brad' :) )

This sounds like something is going wrong here. any_of basically is the same as fnv1a_ch, but it should be ok with a missing member. There is a provision though, that it doesn't act immediately, e.g. to employ the queue for a short interruption. Seems the code, actually waits until its queue reaches a critical size, which is not quite what you want here. The wait time is 1.5s, which I assume is OK for your scenario. But it should push away all metrics after that. What is the queue-load on this relay, do you have them filled up when this happens?

Looks like we better stop sending metrics to any failed node too, I think this needs some work on my side.

reyjrar commented 2 years ago

OK, dug deeper into the metrics.. it looks like the local relay is behaving. There was a rendering issue on my grafana dashboard that made it look like the queue was only growing.. I don't know why that happened, but after reading your explanation I dug in deeper and you are correct. It looks like about every ~30s the local relay flushes the queues to the other collector, so my problem is elsewhere :)

Sorry for the false positive.. I'll trace things more in-depth today. The upgrade to 3.7.4 gave me better visibility into the metrics thanks to the send statistics to cluster config which is awesome.

Thanks Fabian! :)

grobian commented 2 years ago

Good to hear it actually behaves like it should eventually, but from the code-review I did yesterday, I think it can be improved somewhat.

grobian / carbon-c-relay