grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

Graphite cluster and Changing IPs #413

Closed ra-dft closed 4 years ago

ra-dft commented 4 years ago

Hello,

We have a graphite cluster with the following config:

cluster dc-new1 fnv1a_ch replication 2 <a.a.a.a>:2004 proto tcp <a.a.a.b>:2004 proto tcp <a.a.a.c>:2004 proto tcp <a.a.a.d>:2004 proto tcp <a.a.a.e>:2004 proto tcp <a.a.a.f>:2004 proto tcp

We didn't impliment "Instance" hashing at the time of the build due do the ambiguity in the README at the time of reading. However, we're in a position where we're in need of changning the IP's in of our cluster nodes and as we understand it now, fnv1a_ch is hashing based on <a.a.a.X>:2004 at the moment. We would like to take this opportunity to implement the "Instance" functionality of fnv1a hashing, of course without data gaps or data loss and a reshuffling of whisper files across the db nodes.

One thought, based on our understanding would be to create a new cluster definition with the new IPs and an "Instance" value in case we ever run into this situation again of having to change the IPs of our cluster nodes. Here's what we had in mind, define a 2nd cluster config in the same conf file then cut over our "match" rules to send data to the new cluster definition. However, my main question is, if we were to set the "instance" string to the value of the old IP and port would the consistent hashing maintain the same destination hosts for the data so we don't end up duplicating whisper DB files across the cluster? For instance:

cluster dc-new2 fnv1a_ch replication 2 <x.x.x.a>:2004=<a.a.a.a>:2004 proto tcp <x.x.x.b>:2004=<a.a.a.b>:2004 proto tcp <x.x.x.c>:2004=<a.a.a.c>:2004 proto tcp <x.x.x.d>:2004=<a.a.a.d>:2004 proto tcp <x.x.x.e>:2004=<a.a.a.e>:2004 proto tcp <x.x.x.f>:2004=<a.a.a.f>:2004 proto tcp

I hope my question makes sense.

grobian commented 4 years ago

using a = 1, b = 2, etc:

% ./relay -t -f change_ips.conf
...
cluster old-dc
    fnv1a_ch replication 2
        <1.1.1.1>:2004
        <1.1.1.2>:2004
        <1.1.1.3>:2004
        <1.1.1.4>:2004
        <1.1.1.5>:2004
        <1.1.1.6>:2004
    ;
cluster new-dc
    fnv1a_ch replication 2
        <24.24.24.1>:2004=<1.1.1.1>:2004
        <24.24.24.2>:2004=<1.1.1.2>:2004
        <24.24.24.3>:2004=<1.1.1.3>:2004
        <24.24.24.4>:2004=<1.1.1.4>:2004
        <24.24.24.5>:2004=<1.1.1.5>:2004
        <24.24.24.6>:2004=<1.1.1.6>:2004
    ;

match *
    send to
        old-dc
        new-dc
    ;

foo.bar.bleh
match
    * -> foo.bar.bleh
    fnv1a_ch(old-dc)
        <1.1.1.2>:2004
        <1.1.1.3>:2004
    fnv1a_ch(new-dc)
        <24.24.24.2>:2004
        <24.24.24.3>:2004

blahalashsakjksa
match
    * -> blahalashsakjksa
    fnv1a_ch(old-dc)
        <1.1.1.5>:2004
        <1.1.1.6>:2004
    fnv1a_ch(new-dc)
        <24.24.24.5>:2004
        <24.24.24.6>:2004

...

You can also run -t -d to see the dumps of the hash-rings and compare those. But yes, you've correctly understood the theory behind the instance for fnv1a_ch. As you can see the routing targets are identical, something which wouldn't have happened without the instance.

ra-dft commented 4 years ago

Fantastic and thank you for the hint on how to test. This is much appreciated.