grobian / carbon-c-relay

Enhanced C implementation of Carbon relay, aggregator and rewriter
Apache License 2.0
380 stars 107 forks source link

How to add a new backend host to a fnv1a_ch cluster with a replication factor of 1? #283

Closed mwtzzz-zz closed 7 years ago

mwtzzz-zz commented 7 years ago

Hi, my understanding (admittedly without spending much time thinking about it), was that I could simply add an additional host to the fnv1a_ch cluster definition (replication factor 1) and (a) existing metrics would still be routed to the same hosts as before; (b) while brand-new metrics would start to make their way to the new host. I assumed that the hashing mechanism would contine to route existing metrics as before.

But this isn't the case. I added a new host, and it immediately starting receiving an even spread of metrics that were previously going to the other hosts. Of course this had the result of our frontend giving unpredictable and incorrect graphs.

My question is: what is the proper way to add a new backend host?

I'm using instance ids. my cluster definition looks like this:

cluster radar122
  fnv1a_ch
    radar122-a:1905=a
    radar122-b:1905=b
    radar122-c:1905=c
    radar122-d:1905=d
    ....
    ;
mwtzzz-zz commented 7 years ago

looking at the man page, the following jumps out at me:

When using the fnv1a_ch cluster, this instance overrides the hash key in use.

Is this the reason for the behavior I observed?

deniszh commented 7 years ago

Hello @mwtzzz, Unfortunately, your initial assumption is wrong. If you use any consistent hashing (carbon_ch, fnv1a_ch or jump_fnv1a_ch - doesn't matter), after adding a host to the cluster of N nodes and K metrics routing of K/N metrics will be changed (and that's good, because for normal - "non-consistent" - hashing adding a host causing changing in all K metric's routing, see wiki)

So, you should rebalance your cluster after adding a node. You can use carbonate (but it supports only carbon hashing) or buckytools (supports carbon_ch, fnv1a_ch or jump_fnv1a_ch) Please also note that modern Graphite (0.9.15/16 or 1.0.x) can "merge" metrics on the fly (set REMOTE_STORE_MERGE_RESULTS=True), so, you should get consistent graphs even if you didn't rebalance your cluster (but only for carbon_ch and fnv1a_ch hashes, not jump one).

mwtzzz-zz commented 7 years ago

@deniszh Thanks for your quick response and for providing these two suggestions. I think I'm going to look at the Merge option. If I go with Merge, should I still rebalance?

deniszh commented 7 years ago

It depends. I think too much merging will make rendering slow after some point. Also, note that whisper size is fixed (if you not using sparse files), so, adding new host will cause the creation of K/N new whisper files, which will consume disk space.

mwtzzz-zz commented 7 years ago

I'm testing buckytools, but it says it doesn't support fnv1a_ch:

setuidgid uuu ./buckyd -node ec2-xxx.compute-1.amazonaws.com -hash fnv1a_ch
2017/06/25 20:16:19 Invalide hash type.  Supported types: [carbon jump_fnv1a]

Is there another way to rebalance the cluster?

grobian commented 7 years ago

@jjneely: fnv1a_ch indeed seems unimplemented, would you accept a patch adding it? Looks as if the change would mostly be adding code to hashing.go, I could try adding it.

mwtzzz-zz commented 7 years ago

@grobian Thanks for working on it. If you make a patch, I'll test it on my cluster of 12 hosts, each of which has about 600GB of metric data.

mwtzzz-zz commented 7 years ago

Hi @grobian any success with a patch?

grobian commented 7 years ago

haven't got the spare cycles to look into it yet, sorry

deniszh commented 7 years ago

@mwtzzz - you can migrate to jump_fnv1a_hash - you will need to move more data, ofc, but only once

mwtzzz-zz commented 7 years ago

@deniszh What are the steps to migrate from fnv1a_ch to jump_fnv1a_ch ?

mwtzzz-zz commented 7 years ago

@grobian Thanks for working on it. Do you have a patch I can test out?

deniszh commented 7 years ago

@mwtzzz : sorry, disregard my advise - graphite-web doesn't support jump_fnv1a_ch, so, you'll need or migrate to carbon_ch or use something like go-carbon + carbonzipper

grobian commented 7 years ago

I can't seem to build bucktools (my go is too new or something?) so no patch. Seems like it's not necessary either if you don't use carbonzipper.

mwtzzz-zz commented 7 years ago

Thanks for working on it, I'll think about what my next steps will be.

grobian commented 7 years ago

Adding it to buckytools is not that trivial, because it port is currently ignored, and the fnv1a_ch hash type needs it.

deniszh commented 7 years ago

Yep, I tried to add it to buckytools too, but lost. I added support to latest carbonate, but it will need support from latest carbon too. So, maybe carbonzipper will be best option for you.

grobian commented 7 years ago

I came up with this https://github.com/grobian/buckytools/commit/50a706a9300ee0d8ac82656ed06f806b1d0514ea

mwtzzz-zz commented 7 years ago

Ok, I'll test it out soon.

mwtzzz-zz commented 7 years ago

I installed your patch and am running buckyd on each of our 12 hosts as follows: ./buckyd -node radar122-X.mgmt -p /media/ephemeral0/carbon/storage/whisper/ -hash fnv1a radar122-{a..l}.mgmt But it doesn't seem to be working:

[root@ec2-xxx radar122 bin]$ ./bucky inconsistent
2017/07/16 06:23:28 Results from radar122-a.mgmt:4242 not available. Sleeping.
2017/07/16 06:23:28 Results from radar122-i.mgmt:4242 not available. Sleeping.
...
[root@ec2-xxx radar122 bin]$  ./bucky list -r '^carbon\.'                                    
2017/07/16 06:24:58 Results from radar122-i.mgmt:4242 not available. Sleeping.
2017/07/16 06:24:58 Results from radar122-g.mgmt:4242 not available. Sleeping.
...

Am I running it correctly?

deniszh commented 7 years ago

I think radar122-{a..l}.mgmt will not work. You need to enter all hosts, space separated. Like radar122-a.mgmt radar122-b.mgmt radar122-c.mgmt radar122-d.mgmt radar122-e.mgmt radar122-f.mgmt radar122-g.mgmt radar122-h.mgmt radar122-i.mgmt radar122-j.mgmt radar122-k.mgmt radar122-l.mgmt

deniszh commented 7 years ago

Also please note that if you're using non-2003 port and/or instance names - they also need to be included, like radar122-a.mgmt:2103:a radar122-b.mgmt:2013:a radar122-c.mgmt:2103:a ... But it depends hoiw it's configured in relay.conf ofc.

mwtzzz-zz commented 7 years ago

my relay config looks like this:

cluster radar122
  fnv1a_ch 
    radar122-a.mgmt:1905=a 
    radar122-b.mgmt:1905=b 
    radar122-c.mgmt:1905=c 
  ...

I took your suggestion and tried running buckyd like this: /tmp/buckyd -node radar122-b.mgmt -p /media/ephemeral0/carbon/storage/whisper/ -hash fnv1a radar122-a.mgmt:1905:a radar122-b.mgmt:1905:b radar122-c.mgmt:1905:c Port 4242 is reachable from all the hosts, but I still see the following messages:

[root@ec2- radar122 bin]$ ./bucky list -h radar122-b.mgmt:4242                    
2017/07/16 20:19:53 Results from radar122-c.mgmt:4242 not available. Sleeping.
2017/07/16 20:19:53 Results from radar122-a.mgmt:4242 not available. Sleeping.
deniszh commented 7 years ago

What buckyd logs from stdout / stderr says?

mwtzzz-zz commented 7 years ago

Here are the buckyd stdout/stderr logs:

2017/07/18 20:02:17 Starting server on 0.0.0.0:4242
2017/07/18 20:04:24 172.17.35.131:58099 - - GET /hashring
2017/07/18 20:04:24 172.17.35.131:58099 - - GET /metrics?force=true
2017/07/18 20:04:24 Scaning /media/ephemeral0/carbon/storage/whisper/ for metrics...
2017/07/18 20:04:24 172.17.35.131:58129 - - GET /metrics?force=true
2017/07/18 20:04:25 172.17.35.131:58149 - - GET /metrics?force=true
2017/07/18 20:04:26 172.17.35.131:58191 - - GET /metrics?force=true
2017/07/18 20:04:38 172.17.35.131:58203 - - GET /hashring
2017/07/18 20:04:39 172.17.35.131:58203 - - GET /metrics?force=true
2017/07/18 20:04:39 172.17.35.131:58237 - - GET /metrics?force=true
2017/07/18 20:04:40 172.17.35.131:58259 - - GET /metrics?force=true
2017/07/18 20:04:41 172.17.35.131:58277 - - GET /metrics?force=true
2017/07/18 20:04:43 172.17.35.131:58355 - - GET /metrics?force=true
2017/07/18 20:04:49 172.17.35.131:58403 - - GET /hashring
2017/07/18 20:04:49 172.17.35.131:58403 - - GET /metrics?force=true
2017/07/18 20:04:49 172.17.35.131:58439 - - GET /metrics?force=true
2017/07/18 20:04:50 172.17.35.131:58463 - - GET /metrics?force=true
2017/07/18 20:04:51 172.17.35.131:58493 - - GET /metrics?force=true
2017/07/18 20:04:53 172.17.35.131:58519 - - GET /metrics?force=true
2017/07/18 20:04:56 172.17.35.131:58539 - - GET /metrics?force=true
2017/07/18 20:05:01 172.17.35.131:58571 - - GET /metrics?force=true
2017/07/18 20:05:09 172.17.35.131:58625 - - GET /metrics?force=true
grobian commented 7 years ago

let's move this to the buckytools issue.

mwtzzz-zz commented 7 years ago

If we can get buckytools working with fnv1a_ch, that would be fantastic.

grobian commented 7 years ago

https://github.com/jjneely/buckytools/issues/17

mwtzzz-zz commented 7 years ago

@grobian I'll try your patch again this weekend. Maybe I'm not running buckyd correctly. I'll experiment with different ways of specifying the members of the ring on the command line.

grobian commented 7 years ago

I'm no expert on buckytools, if I find some cycles, I'll try myself

mwtzzz-zz commented 7 years ago

That would be great

mwtzzz-zz commented 7 years ago

Hi @grobian I'm just getting back at looking at this issue. I got pulled away on other things at work but now I need to take a look again.

Have you had a chance to try making a patch?

grobian commented 7 years ago

I thought we concluded in https://github.com/jjneely/buckytools/issues/17 :)

mwtzzz-zz commented 7 years ago

Oh wow, I missed that! Excellent, let me try it out today. Thanks!

mwtzzz-zz commented 7 years ago

@grobian I'm having issues with version 0.40. Would you mind taking a look at my comment in https://github.com/jjneely/buckytools/issues/17 ?