grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Consistent hash balance with short instance names #419

Closed fubarwrangler closed 4 years ago

fubarwrangler commented 4 years ago

I'm finding some issues with balancing using fnv1a_ch. I have a relay locally running forwarding to several carbon-cache.py daemons as follows: OLD CONFIG

cluster default-local
  fnv1a_ch
    127.0.0.1:2103=a
    127.0.0.1:2105=b
    127.0.0.1:2107=c
    127.0.0.1:2109=d
    127.0.0.1:2111=e;
match
  *
  send to default-local;
listen
  type linemode
    3003 proto tcp
    3003 proto udp;

When I look at the number of metrics forwarded to each cache (a-e) they are very unablanced. See the left regions of this relay-balance where the first cache (port 2103) gets double the metrics of others.

After changing the instance tags with some random strings: NEW CONFIG

    127.0.0.1:2103=AF26iXLpC2IG
    127.0.0.1:2105=Zly0sKvWSfbv
    127.0.0.1:2107=0usDLUQ5BF9R
    127.0.0.1:2109=9FWxa6UKNBrl
    127.0.0.1:2111=DiIHdgOh7AtM

Things balance much better as seen in the right side of the image above.

Is this expected behavior? (I'm running 3.7.2 on RHEL 7)

fubarwrangler commented 4 years ago

One other (possibly relevant) point is that simply changing to any_of instead of fnv1a_ch even with the short tags (a-e) makes the behavior balanced again (like the right side of the image above)

grobian commented 4 years ago

fnv1a_ch takes the thing after = (tag) as hash-key, so yes, taking a larger, random string is going to give you a better balancing than the single character strings.

any_of doesn't use the tags and basically is the same as fnv1a_ch without tags, but it doesn't care about unavailable targets so much. For your usage scenario, you want to use any_of, not a consistent hash (_ch)

fubarwrangler commented 4 years ago

In graphite-web's local_settings I have CARBONLINK_HOSTS set to the cache-query ports of each carbon instance with the tags in place (matching the tags above) and CARBONLINK_HASHING_TYPE=fnv1a_ch. If I switch to any_of without tags won't graphite-web no longer be able to find metrics in the non-flushed cache of each carbon daemon (since the port numbers differ I assume the hashing will not be correct)?

PS: Thanks for taking the time, it's much appreciated!

grobian commented 4 years ago

hmm, that may very well, be the case (that the entries from non-flushed cache won't be found), but then you ought to use the longer/more unique tags

fubarwrangler commented 4 years ago

Right, that's what I was thinking, I'm just using the first 6 chars of the md5 sum of a..z instead. I was confused because I saw the simple label used in examples and thought the hash-behavior would depend on both the incoming metric and the instance string, which I would naively expect to behave better in distribution...

Anyway, if this is expected behavior than feel free to close this, but perhaps consider adding a warning in the docs (like in #420).

grobian commented 4 years ago

merged your pr, thanks!