Open github-vincent-miszczak opened 5 years ago
the issue where that python code is introduced is https://github.com/graphite-project/carbon/pull/196#issuecomment-30669907 it describes 2 different hosts hashing to the same position. but here you're showing the same ip hashing to two different positions (which seems like normal behavior) and a position not existing. how is this the same problem?
Here I'm showing there is a drift between the current implementation and Python one making the two implementations incompatible.
I just picked a range that illustrates the drift. With the provided host list, the range 59417-59482
should have 3 values according to reference implementation.
As you stated, there are only 2 with this one.
I'm not solving https://github.com/graphite-project/carbon/pull/196. I'm solving compatibility issues with the reference implementation.
but in your PR which fixes this issue, you claim to introduce the equivalent for https://github.com/graphite-project/carbon/commit/024f9e67ca47619438951c59154c0dec0b0518c7#diff-1486787206e06af358b8d935577e76f5 which is the solution to https://github.com/graphite-project/carbon/pull/196
so how it is possible that we're not aiming to solve https://github.com/graphite-project/carbon/pull/196 here ?
Because we aim to solve https://github.com/graphite-ng/carbon-relay-ng/issues/335 does not mean we aim to solve https://github.com/graphite-project/carbon/pull/196. It may do, but that's a non goal. I never claimed the issues are the same.
I don't know why https://github.com/graphite-project/carbon/pull/196 appeared in this conversation. I just pointed the commit on carbon
side that results in a drift, not the whole discussion. While the Python patch came from the discussion, I'm not sure this is relevant to this issue. Python discussion is about muting the hash-ring and replication side effects. carbon-relay-ng
uses immutable hash-ring and has not replication support.
The issue we are discussing here is incompatible algorithm between carbon-relay-ng
and carbon
.
I have multiple carbon-caches receiving metrics from a relay-ng route in consitentHashing mode. Thanks to the parameter CARBONLINK_HOSTS, the Graphite webapp may query the caches for data that has not yet been persisted. Without the vmiszczak-teads's fix, the webapp looks up data in the wrong cache because the hashing algorithm is not compatible.
so @hamelg you can independently confirm that the behavior from this patch is the right one? i will bump this on my priority list then.
my bad ! No, i don't confirm. I double checked and I made a mistake. Here, the current consistent hashing in relay-ng works fine and gives the same result than the graphite stack. Forget my last comment. I have tried to reproduce the issue with several test cases without success.
so @hamelg to be clear you're saying then the code in master works fine and we should not merge this PR?
If that can help here is some code that illustrate the issue: https://gist.github.com/vmiszczak-teads/23399bfe47f7bddfbd5f7d9556e9e796 https://gist.github.com/vmiszczak-teads/16aed8bf3dd0f2879ab14712673c143f https://gist.github.com/vmiszczak-teads/944e57266b161c9b40ff51b2628e5e9f
Finally, I have succeeded in reproducing the issue with the specific setup provided by vmiszczak-teads.
carbon-relay-ng.conf
[[route]]
key = 'default'
type = 'consistentHashing'
destinations = [
'10.20.34.114:12003 spool=false pickle=false',
'10.20.39.104:12003 spool=false pickle=false',
'10.20.40.161:12003 spool=false pickle=false',
'10.20.35.158:12003 spool=false pickle=false',
'10.20.37.70:12003 spool=false pickle=false',
'10.20.40.126:12003 spool=false pickle=false',
'10.20.33.78:12003 spool=false pickle=false',
'10.20.39.19:12003 spool=false pickle=false',
'10.20.42.66:12003 spool=false pickle=false',
'10.20.34.131:12003 spool=false pickle=false',
'10.20.38.55:12003 spool=false pickle=false',
'10.20.41.75:12003 spool=false pickle=false',
'10.20.32.8:12003 spool=false pickle=false',
'10.20.37.165:12003 spool=false pickle=false'
]
carbonate.conf
[main]
DESTINATIONS = 10.20.34.114:12003,10.20.39.104:12003,10.20.40.161:12003,10.20.35.158:12003,10.20.37.70:12003,10.20.40.126:12003,10.20.33.78:12003,10.20.39.19:12003,10.20.42.66:12003,10.20.34.131:12003,10.20.38.55:12003,10.20.41.75:12003,10.20.32.8:12003,10.20.37.165:12003
REPLICATION_FACTOR = 1
HASHING_TYPE = carbon_ch
Without his patch, the hashing result is different on these nodes : 10.20.32.8 10.20.34.131 10.20.35.158 10.20.39.104 10.20.40.126 10.20.42.66
After applying the patch, carbonate and relay-ng give the same result.
I found out that Carbon in Python has different behavior than this project
carbon_ch
.This is a severe issue because a tool like
carbonate
does not sees the same world thancarbon-relay-ng
. I almost messed a 5TB cluster wanting to purge metrics after a rebalance.Using server list :
gives ring positions including:
Python gives:
Notice
(59418, ('10.20.34.131', None))
exists in Python and not in this implementation.Having a metric that hashes to
59418
gives destination10.20.40.126
with carbon-relay-ng and10.20.34.131
with Python/Carbonate. If you filter out metrics withcarbon-sieve
to delete after a rebalance, you may delete metrics that you should not.