jjneely / buckytools

Go implementation of useful tools for dealing with Graphite's Whisper DBs and Carbon hashing
Other
87 stars 21 forks source link

Trouble rebalancing a go-carbon cluster #38

Closed zerosoul13 closed 2 years ago

zerosoul13 commented 2 years ago

Hello,

We've been experimenting with this tool to rebalance a go-carbon cluster with a great amount of metrics but been having trouble finding the right configuration for it.

Our Graphite cluster is running the go-graphite stack (carbon-relay-ng -> go-carbon) running on Kubernetes. We would like like to match the carbon-relay-ng consistent hashing to buckyd.

Carbon-relay-ng

Our destinations on carbon-relay-ng are as below

['go-carbon-0.go-carbon.graphite:2004 spool=true pickle=true','go-carbon-1.go-carbon.graphite:2004 spool=true pickle=true','go-carbon-2.go-carbon.graphite:2004 spool=true pickle=true','go-carbon-3.go-carbon.graphite:2004 spool=true pickle=true','go-carbon-4.go-carbon.graphite:2004 spool=true pickle=true','go-carbon-5.go-carbon.graphite:2004 spool=true pickle=true']

BuckyD

NOTE: buckyd tools is running as a sidecar container. Both go-carbon and the sidecar get the same mount to read/write whisper files to

/usr/sbin/buckyd --node go-carbon-5.go-carbon.graphite -b 0.0.0.0:4242 -hash carbon -p /mnt/var/lib/graphite/whisper/ -t /tmp -timeout 7200 go-carbon-0.go-carbon.graphite:2004 go-carbon-1.go-carbon.graphite:2004 go-carbon-2.go-carbon.graphite:2004 go-carbon-3.go-carbon.graphite:2004 go-carbon-4.go-carbon.graphite:2004 go-carbon-5.go-carbon.graphite:2004

Bucky

Here's where I start to have trouble

/usr/sbin/bucky servers -h go-carbon-0.go-carbon.graphite:4242
2021/11/10 01:01:57 Error retrieving URL: Get "http://go-carbon-0.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:45902->172.16.76.119:2004: read: connection reset by peer
2021/11/10 01:01:57 Cluster unhealthy: go-carbon-0.go-carbon.graphite:2004: Get "http://go-carbon-0.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:45902->172.16.76.119:2004: read: connection reset by peer

2021/11/10 01:01:57 Error retrieving URL: Get "http://go-carbon-1.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:47772->172.16.27.82:2004: read: connection reset by peer
2021/11/10 01:01:57 Cluster unhealthy: go-carbon-1.go-carbon.graphite:2004: Get "http://go-carbon-1.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:47772->172.16.27.82:2004: read: connection reset by peer

2021/11/10 01:01:57 Error retrieving URL: Get "http://go-carbon-2.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:40946->172.16.58.32:2004: read: connection reset by peer
2021/11/10 01:01:57 Cluster unhealthy: go-carbon-2.go-carbon.graphite:2004: Get "http://go-carbon-2.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:40946->172.16.58.32:2004: read: connection reset by peer

2021/11/10 01:01:57 Error retrieving URL: Get "http://go-carbon-3.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:43570->172.16.127.44:2004: read: connection reset by peer
2021/11/10 01:01:57 Cluster unhealthy: go-carbon-3.go-carbon.graphite:2004: Get "http://go-carbon-3.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:43570->172.16.127.44:2004: read: connection reset by peer

2021/11/10 01:01:57 Error retrieving URL: Get "http://go-carbon-4.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:51674->172.16.27.91:2004: read: connection reset by peer
2021/11/10 01:01:57 Cluster unhealthy: go-carbon-4.go-carbon.graphite:2004: Get "http://go-carbon-4.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:51674->172.16.27.91:2004: read: connection reset by peer

2021/11/10 01:01:57 Error retrieving URL: Get "http://go-carbon-5.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:56712-172.16.14.79:2004: read: connection reset by peer
2021/11/10 01:01:57 Cluster unhealthy: go-carbon-5.go-carbon.graphite:2004: Get "http://go-carbon-5.go-carbon.graphite:2004/hashring": read tcp 172.16.14.79:56712->172.16.14.79:2004: read: connection reset by peer`

Buckd daemons are using port: 4242
Hashing algorithm: [carbon: 6 nodes, 100 replicas, 600 ring members go-carbon-0.go-carbon.graphite:2004=None go-carbon-1.go-carbon.graphite:2004=None go-carbon-2.go-carbon.graphite:2004=None go-carbon-3.go-carbon.graphite:2004=None go-carbon-4.go-carbon.graphite:2004=None go-carbon-5.go-carbon.graphite:2004=None]
Number of replicas: 100
Found these servers:
        go-carbon-0.go-carbon.graphite:2004
        go-carbon-1.go-carbon.graphite:2004
        go-carbon-2.go-carbon.graphite:2004
        go-carbon-3.go-carbon.graphite:2004
        go-carbon-4.go-carbon.graphite:2004
        go-carbon-5.go-carbon.graphite:2004

Is cluster healthy: false
2021/11/10 01:01:57 Cluster is inconsistent.

The only reason I use the hostname:port notation is because I want to match it to my carbon-relay-ng destinations but the port addition doesn't really help. Now, if I remove the port and only provide the hostname, buckytools does report the cluster as healthy.

While the cluster is reported as healthy, I can trigger the rebalance but then, after a couple of hours after, I start to see new metrics reported as inconsistent once again.

Help

  1. Could someone please provide an example of how I should be running buckyd/bucky tools to get it to match to my current carbon-relay-ng destinations? I've rebalanced a cluster 3 times already.

  2. I see that carbon-relay-ng does offer an undocumented option to add instance label to a destination. Should I use it so that I can match the notation SERVER:PORT:INSTANCE to make the hashing ring match on both sides?

zerosoul13 commented 2 years ago

Been checking the code and found https://github.com/jjneely/buckytools/blob/master/cmd/bucky/cluster.go#L79 sets the port value to the same value as the one provided by a user in buckyd upon startup. Port 2004 in our case.

I've created a fork of the code and updated line https://github.com/go-graphite/buckytools/blob/master/cmd/bucky/cluster.go#L88. instead of v.Port I've set it to 4242

        Cluster.Servers = append(Cluster.Servers, fmt.Sprintf("%s:%d", v.Server, 4242))

The output has now changed

/ # /usr/sbin/bucky servers -h go-carbon-5.go-carbon.graphite:4242
Buckd daemons are using port: 4242
Hashing algorithm: [carbon: 6 nodes, 100 replicas, 600 ring members go-carbon-0.go-carbon.graphite:2004=None go-carbon-1.go-carbon.graphite:2004=None go-carbon-2.go-carbon.graphite:2004=None go-carbon-3.go-carbon.graphite:2004=None go-carbon-4.go-carbon.graphite:2004=None go-carbon-5.go-carbon.graphite:2004=None]
Number of replicas: 100
Found these servers:
        go-carbon-0.go-carbon.graphite:4242
        go-carbon-1.go-carbon.graphite:4242
        go-carbon-2.go-carbon.graphite:4242
        go-carbon-3.go-carbon.graphite:4242
        go-carbon-4.go-carbon.graphite:4242
        go-carbon-5.go-carbon.graphite:4242

Is cluster healthy: true

Question here is, would this update have any negative side effect while rebalancing?

jjneely commented 2 years ago

It should not have a negative side effect. If this is the "carbon" hashing method the port (or middle value in the triple) isn't used to inform the hashring. Just the hostname and the optional instance.

When configuring buckyd, you want it to find the other buckyd servers on your other nodes. Therefore, you want to use, in your case, port 4242, which is the default port. Port 2004 is go-carbon and, obviously, doesn't support bucky's little API.

You are seeing metrics re-appear in their old places after a re-balance. Why are you attempting to re-balance here? Are you adding additional nodes/storage? If the graphite cluster is unchanged, then I suspect their may be a hashing algorithm difference between your carbon-relay-ng and buckyd. I know there are several options for hash ring algorithms here with various tradeoffs.

zerosoul13 commented 2 years ago

We are rebalancing the cluster due to new node addition.

I suspect their may be a hashing algorithm difference between your carbon-relay-ng and buckyd

Would you have an example on how to set them up in such way that they agree on the hashring?

deniszh commented 2 years ago

@zerosoul13 : could you please post your config for carbon-relay-ng somewhere (e.g. on https://gist.github.com/ or https://pastebin.com/) and post link here?

zerosoul13 commented 2 years ago

@deniszh Here's my carbon-relay-ng configuration

https://gist.github.com/zerosoul13/04f6d390fb0af7303029d9d737c6a6b4

zerosoul13 commented 2 years ago

I think I know why there's an issue:

I'm raising the ticket on the wrong repo. I think at one point I switched from jjneely/buckytools to go-graphite/buckytools didn't realized

@jjneely in your code, this is true

It should not have a negative side effect. If this is the "carbon" hashing method the port (or middle value in the triple) isn't used to inform the hashring. Just the hostname and the optional instance.

and can be seen here: https://github.com/jjneely/buckytools/blob/master/cmd/bucky/inconsistent.go#L65

On the go-graphite version (which I'm using) this is not true. the port does play a role since it will compare both ports (4242 and 2004) and try to match them (4242 and 2004). Metrics are reported as inconsistent because of this.

The number of metrics that are reported as inconsistent are almost the same as the amount of metrics the cluster holds which seems very odd.

I will close this issue and move it to the right repo if the issue continues. Thank you both for your help

deniszh commented 2 years ago

@zerosoul13 : if you using full hostname like go-carbon-1.go-carbon.gs-bg-graas in relay config you should use the same full hostname in bucky config. You can use IPs or short hosnames or long hostnames - but hashing algorithm using that part for calculating metric distribution, so, it should match between relay and buckyd.

zerosoul13 commented 2 years ago

It was a bad editing on my part. Here's the updated carbon-relay-ng configuration https://gist.github.com/zerosoul13/d92aa97c3c7d72d39dbfb1f942de11e1