Closed ecsumed closed 5 years ago
I don't quite get what that comment means. Or the behaviour that it controls.
c-relay does support consistent hashing + replication, in which case when a node goes down, points aren't lost. It costs more diskspace, which is what the "expensive" remark refers to I believe.
To be a bit more clear, I understand that the relay queue's up the metrics if an endpoint is down. I also understand that such a feature defeats the purpose of consistent hashing. But, dropped metrics after a queue reaches it's capacity is a bigger problem then the metrics ending up on the wrong node.
if you want non-deliverable data to be delivered to a random node, then you just have to use a non-consistent hash, which fails over. any_of
is such cluster type, which uses fnv1a hash (same as fnv1a_ch hash distribution) but when a destination is unavailable, the contents is offloaded to other servers (thus, any of the servers).
Aha, that's exactly what I needed. I didn't know any_of
used the fnv1a hash (I guess I should have tested this). Anyways, tested this and it works exactly as I wanted. Thanks for the help.
It is (not that many words) in the CLUSTERS section of the README. It mentions that it attempts to consistently deliver, but not that it uses fnv1a. I didn't document that because I wanted to keep the freedom to change this in the future. :)
Slight inconvenience with this, posting here for posterity.
While it does seem to use fnv1a, the order is out of sorts and, hence, does not play well with buckytools. So after a failover, once a node does come back up, a rebalance will be really difficult.
fyi @grobian
Yes, that's the exact reason for consistent hashing.
So, I might be missing something here. What exactly do you want to see happening on node failure, and how do you think that should be resolved afterwards?
@grobian Apologies for not being clear.
I perfectly understand the randomness after the failover. The problem is that even before the failover the metrics are out of order.
Consider this, 3 nodes, 3 metrics:
cluster graphite
fnv1a_ch replication 1
<IP-A>:2003=a
<IP-B>:2003=b
<IP-C>:2003=c
;
Results in,
node a
/mnt/whisper/cg/host-2/metric-2.wsp
/mnt/whisper/cg/host-3/metric-1.wsp
node b
/mnt/whisper/cg/host-2/metric-3.wsp
/mnt/whisper/cg/host-3/metric-3.wsp
/mnt/whisper/cg/host-1/metric-1.wsp
node c
/mnt/whisper/cg/host-2/metric-1.wsp
/mnt/whisper/cg/host-3/metric-2.wsp
/mnt/whisper/cg/host-1/metric-3.wsp
/mnt/whisper/cg/host-1/metric-2.wsp
But this:
cluster graphite
any_of
<IP-A>:2003=a
<IP-B>:2003=b
<IP-C>:2003=c
;
Results in
node a
/mnt/whisper/cg/host-2/metric-3.wsp
/mnt/whisper/cg/host-3/metric-2.wsp
/mnt/whisper/cg/host-1/metric-1.wsp
/mnt/whisper/cg/host-1/metric-2.wsp
node b
/mnt/whisper/cg/host-2/metric-1.wsp
/mnt/whisper/cg/host-3/metric-1.wsp
node c
/mnt/whisper/cg/host-2/metric-2.wsp
/mnt/whisper/cg/host-3/metric-3.wsp
/mnt/whisper/cg/host-1/metric-3.wsp
Buckytools complains about this second distribution
<IP-B>:4242: cg.host-2.metric-1
<IP-B>:4242: cg.host-3.metric-1
<IP-A>:4242: cg.host-1.metric-1
<IP-A>:4242: cg.host-1.metric-2
<IP-A>:4242: cg.host-2.metric-3
<IP-A>:4242: cg.host-3.metric-2
<IP-C>:4242: cg.host-2.metric-2
<IP-C>:4242: cg.host-3.metric-3
Under ideal conditions (i.e. before any failover has taken place), shouldn't the metrics distribution match the fnv1a_ch cluster?
I think we need a new cluster type or flag for this behaviour, because any_of doesn't take the alias part of targets or something, it uses a different strategy based on the order in which the hosts are added.
Yes, in the logs, the relay would strip away the node alias/label. This
cluster graphite
any_of
<IP-A>:2003=a
<IP-B>:2003=b
<IP-C>:2003=c
;
would become
cluster graphite
any_of
<IP-A>:2003
<IP-B>:2003
<IP-C>:2003
;
This would be great to have. Can this be marked an "enhancement"
is there a way you could test the above commit?
@grobian Yes! Thanks. I'll test it and let you know.
@grobian The fallback doesn't work.
I get a (ERR) failed to resolve dynamic:2003, server unavailable
every time metrics are sent to the relay.
This is my conf
cluster graphite
fnv1a_ch dynamic
<IP-A>:2003=a
<IP-A>:2003=b
<IP-A>:2003=c
;
which turns into
cluster graphite
fnv1a_ch replication 1
dynamic:2003
<IP-A>:2003=a
<IP-B>:2003=b
<IP-C>:2003=c
;
This is the logs
[2019-06-18 15:56:08] (MSG) listening on tcp4 0.0.0.0 port 2003
[2019-06-18 15:56:08] (MSG) listening on tcp6 :: port 2003
[2019-06-18 15:56:08] (MSG) listening on udp4 0.0.0.0 port 2003
[2019-06-18 15:56:08] (MSG) listening on udp6 :: port 2003
[2019-06-18 15:56:08] (MSG) listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[2019-06-18 15:56:08] (MSG) starting 1 workers
[2019-06-18 15:56:08] (MSG) starting statistics collector
[2019-06-18 15:56:08] (MSG) starting servers
[2019-06-18 15:56:08] (MSG) startup sequence complete
[2019-06-18 15:58:38] (ERR) failed to resolve dynamic:2003, server unavailable
...
[2019-06-18 15:59:41] (ERR) failed to resolve dynamic:2003, server unavailable
[2019-06-18 16:07:21] (ERR) failed to connect() for <IP-C>:2003: Connection refused
[2019-06-18 16:07:25] (ERR) failed to resolve dynamic:2003, server unavailable
...
It's considering the dynamic
flag as a cluster host.
I have the impression you didn't get the latest changes or something
cluster foo fnv1a_ch dynamic bar bla;
turns into
cluster foo
fnv1a_ch replication 1 dynamic
bar:2003
bla:2003
;
here. As a matter of fact, your example:
cluster graphite
fnv1a_ch replication 1 dynamic
ip-a:2003=a
ip-b:2003=b
ip-c:2003=c
;
% ./relay -v
carbon-c-relay v3.5 (dccf8e)
enabled support for: gzip
regular expressions library: libc
Huh, how embarrassing. Anyways, I've tested it with the latest commit now and it works as intended. It failsover to another server until the problematic node comes back up. And I can buckytools away the inconsistencies.
@grobian Thanks for pushing this through. I owe you a cold one.
let me know if you find issues with this, if not I'll close this issue in a couple of days
I've a need to use a dynamic router for the exact same reason as stated above from the python version of the relay's config. Is there a way to achieve this functionality with c-relay? I think it's not possible, but just wanted to be sure.