graphite-project / carbonate

Utilities for managing graphite clusters
MIT License
516 stars 80 forks source link

Different "consistent hashing" results from Carbon and Carbonate #62

Closed sw0x2A closed 7 years ago

sw0x2A commented 8 years ago

Carbon-sieve reports that some metrics belong to another node but these metrics are actually used and updated by carbon.

DESTINATIONS are defined identical in carbon.conf and carbonate.conf. All daemons are restarted and using this configuration.

Roughly 1/8 of all metrics are reported to belong to another host, the rest is fine.

Cluster has 6 nodes. 1 has haproxy which is load-balancing requests to 8 carbon-relays on the same machine. These carbon-relays use consistent hashing and DESTINATIONS = 172.22.5.14:2014:relay, 172.22.5.106:2014:relay, 172.22.5.107:2014:relay, 172.22.5.234:2014:relay, 172.22.5.235:2014:relay. Each of the entries in DESTINATIONS is a carbon-relay on the other 5 servers and has DESTINATIONS = 127.0.0.1:2004:cache, 127.0.0.1:2104:b, two carbon-cache instances using the same whisper storage path.

When I was using carbin-sieve to clean-up some stuff, I noticed that its results are wrong.

Example: rg.community.pagespeed.publicationDetail.loggedOut.connect.median

File exists on host with IP 172.22.5.14:

$ stat /data/graphite/whisper/rg/community/pagespeed/publicationDetail/loggedOut/connect/median.wsp
  File: ‘/data/graphite/whisper/rg/community/pagespeed/publicationDetail/loggedOut/connect/median.wsp’
  Size: 325816      Blocks: 640        IO Block: 4096   regular file
[...]
Access: 2016-03-06 08:50:09.246530787 +0000
Modify: 2016-05-01 12:58:52.698880921 +0000
Change: 2016-05-01 12:58:52.698880921 +0000
 Birth: -

But carbon-sieve wants it on 172.22.5.106:

$ echo "rg.community.pagespeed.publicationDetail.loggedOut.connect.median" | carbon-sieve -C main -n 172.22.5.106
rg.community.pagespeed.publicationDetail.loggedOut.connect.median

I am troubleshooting this since hours but cannot find a reason. I also noticed that carbon-sieve is using the hashing method from carbon library. Approximately 493000 of 4.11 million metrics are wrong, between 9.5% and 13.5% of the metrics per node. This is close to 12,5% (1/8) like the 8 carbon-relays in haproxy.

Any hints are highly appreciated. If you need more information, please do not hesitate to ask.

sw0x2A commented 8 years ago

Forgot to mention environment. All servers are: Ubuntu 14.04.2 LTS (trusty) Python 2.7.6 carbon==0.9.15 carbonate==0.2.2

deniszh commented 8 years ago

Hi @sw0x2A, did you change something on default carbon.conf? what's DIVERSE_REPLICAS equal for? or better publish it somewhere.

sw0x2A commented 8 years ago

Hi @deniszh,

DIVERSE_REPLICAS is not set in my carbon.conf which defaults to False, I assume. Please find full carbon.conf below.

carbon.conf used on relay host carbon.conf used on cache hosts

deniszh commented 8 years ago

Very strange then. As you can see in the code - https://github.com/graphite-project/carbonate/blob/master/carbonate/cluster.py#L9 - carbonate did not contain hashing code, it uses Graphite code from /opt/graphite/lib/carbon/routers.py Maybe you have different version of carbon installed there?

sw0x2A commented 8 years ago

Checked that already. Same version of Python and carbon and carbonate on all servers.

BTW carbonate.conf for completeness:

[main]
DESTINATIONS = 172.22.5.14:2014:relay, 172.22.5.106:2014:relay, 172.22.5.107:2014:relay, 172.22.5.234:2014:relay, 172.22.5.235:2014:relay
REPLICATION_FACTOR = 1
SSH_USER = root

[old]
DESTINATIONS = 172.22.5.14:2014:relay, 172.22.5.106:2014:relay, 172.22.5.107:2014:relay
REPLICATION_FACTOR = 1
SSH_USER = root
sw0x2A commented 8 years ago

Maybe worth mentioning, this not only happens with carbon-sieve. Actually, consistent hashing results of carbin-sieve and carbon-lookup are the same but in around 12% of the metrics different from where the carbon-relays send the data.

deniszh commented 8 years ago

Sorry, @sw0x2A, has no more ideas. Need to test it by myself, but have no time now, unfortunately. BTW, if you have RF=1 you can try bucky tools - https://github.com/jjneely/buckytools - and check do you have this problem there...

sw0x2A commented 8 years ago

Hi @deniszh , thanks for the link to buckytools. They look quite useful and at least the results of the metrics I tested are the same on carbon-lookup and bucky.

Whisper file is updated on 172.22.5.14 but carbon-lookup and bucky want it on 172.22.5.106. Guess this means something on carbon-relays is wrong...

$ ./bucky locate rg.community.pagespeed.publicationDetail.loggedOut.connect.median
rg.community.pagespeed.publicationDetail.loggedOut.connect.median => 172.22.5.106
$ carbon-lookup rg.community.pagespeed.publicationDetail.loggedOut.connect.median
172.22.5.106:2014:relay
deniszh commented 8 years ago

Yep, quite strange. I'm using https://github.com/grobian/carbon-c-relay as relay now. Could you please maybe check that too?

sw0x2A commented 8 years ago

This is nothing that I can change now but I will keep this in mind. BTW I added a value for that metric using the carbon-client.py. It has been send to and created a new Whisper file on 172.22.5.106.

mthssdrbrg commented 7 years ago

Just ran into the same issue and after a slew of debugging it turned out that we were sending metrics that looked something like prefix-part-1.prefix-part-2..metric (notice the ..), but for carbonate it's impossible to know that a metric contains double dots due to the filesystem being "nice" and just disregarding them when Graphite is writing them to disk.

@sw0x2A, guessing you've already moved on from this (or worked it out somehow), or else you might want to check the same.

sw0x2A commented 7 years ago

@mthssdrbrg It is the same issue here too. Metrics contain .. which is really hard to find when you only check how the metrics are distributed and written to the filesystem. Thanks a lot for your comment!