grafana / carbon-relay-ng

Fast carbon relay+aggregator with admin interfaces for making changes online - production ready
Other
467 stars 151 forks source link

Consistent hashing is different than Python #335

Open github-vincent-miszczak opened 5 years ago

github-vincent-miszczak commented 5 years ago

I found out that Carbon in Python has different behavior than this project carbon_ch.

This is a severe issue because a tool like carbonate does not sees the same world than carbon-relay-ng. I almost messed a 5TB cluster wanting to purge metrics after a rebalance.

Using server list :

"10.20.34.114:12003", "10.20.39.104:12003", "10.20.40.161:12003", "10.20.35.158:12003", "10.20.37.70:12003", "10.20.40.126:12003", "10.20.33.78:12003", "10.20.39.19:12003", "10.20.42.66:12003", "10.20.34.131:12003", "10.20.38.55:12003", "10.20.41.75:12003", "10.20.32.8:12003", "10.20.37.165:12003"

gives ring positions including:

Position: (uint16) 59417,
Hostname: (string) (len=12) "10.20.40.126",
Instance: (string) "",
DestinationIndex: (int) 5
}
(route.hashRingEntry) {
Position: (uint16) 59482,
Hostname: (string) (len=12) "10.20.40.126",
Instance: (string) "",
DestinationIndex: (int) 5
}

Python gives:

(59417, ('10.20.40.126', None)),
(59418, ('10.20.34.131', None)),
(59482, ('10.20.40.126', None)),

Notice (59418, ('10.20.34.131', None)) exists in Python and not in this implementation.

Having a metric that hashes to 59418 gives destination 10.20.40.126 with carbon-relay-ng and 10.20.34.131 with Python/Carbonate. If you filter out metrics with carbon-sieve to delete after a rebalance, you may delete metrics that you should not.

Dieterbe commented 5 years ago

the issue where that python code is introduced is https://github.com/graphite-project/carbon/pull/196#issuecomment-30669907 it describes 2 different hosts hashing to the same position. but here you're showing the same ip hashing to two different positions (which seems like normal behavior) and a position not existing. how is this the same problem?

github-vincent-miszczak commented 5 years ago

Here I'm showing there is a drift between the current implementation and Python one making the two implementations incompatible. I just picked a range that illustrates the drift. With the provided host list, the range 59417-59482 should have 3 values according to reference implementation. As you stated, there are only 2 with this one. I'm not solving https://github.com/graphite-project/carbon/pull/196. I'm solving compatibility issues with the reference implementation.

Dieterbe commented 5 years ago

but in your PR which fixes this issue, you claim to introduce the equivalent for https://github.com/graphite-project/carbon/commit/024f9e67ca47619438951c59154c0dec0b0518c7#diff-1486787206e06af358b8d935577e76f5 which is the solution to https://github.com/graphite-project/carbon/pull/196

so how it is possible that we're not aiming to solve https://github.com/graphite-project/carbon/pull/196 here ?

github-vincent-miszczak commented 5 years ago

Because we aim to solve https://github.com/graphite-ng/carbon-relay-ng/issues/335 does not mean we aim to solve https://github.com/graphite-project/carbon/pull/196. It may do, but that's a non goal. I never claimed the issues are the same.

I don't know why https://github.com/graphite-project/carbon/pull/196 appeared in this conversation. I just pointed the commit on carbon side that results in a drift, not the whole discussion. While the Python patch came from the discussion, I'm not sure this is relevant to this issue. Python discussion is about muting the hash-ring and replication side effects. carbon-relay-ng uses immutable hash-ring and has not replication support.

The issue we are discussing here is incompatible algorithm between carbon-relay-ng and carbon.

hamelg commented 5 years ago

I have multiple carbon-caches receiving metrics from a relay-ng route in consitentHashing mode. Thanks to the parameter CARBONLINK_HOSTS, the Graphite webapp may query the caches for data that has not yet been persisted. Without the vmiszczak-teads's fix, the webapp looks up data in the wrong cache because the hashing algorithm is not compatible.

Dieterbe commented 5 years ago

so @hamelg you can independently confirm that the behavior from this patch is the right one? i will bump this on my priority list then.

hamelg commented 5 years ago

my bad ! No, i don't confirm. I double checked and I made a mistake. Here, the current consistent hashing in relay-ng works fine and gives the same result than the graphite stack. Forget my last comment. I have tried to reproduce the issue with several test cases without success.

Dieterbe commented 5 years ago

so @hamelg to be clear you're saying then the code in master works fine and we should not merge this PR?

github-vincent-miszczak commented 5 years ago

If that can help here is some code that illustrate the issue: https://gist.github.com/vmiszczak-teads/23399bfe47f7bddfbd5f7d9556e9e796 https://gist.github.com/vmiszczak-teads/16aed8bf3dd0f2879ab14712673c143f https://gist.github.com/vmiszczak-teads/944e57266b161c9b40ff51b2628e5e9f

hamelg commented 5 years ago

Finally, I have succeeded in reproducing the issue with the specific setup provided by vmiszczak-teads.

carbon-relay-ng.conf

[[route]]
key = 'default'
type = 'consistentHashing'
destinations = [
'10.20.34.114:12003 spool=false pickle=false',
'10.20.39.104:12003 spool=false pickle=false',
'10.20.40.161:12003 spool=false pickle=false',
'10.20.35.158:12003 spool=false pickle=false',
'10.20.37.70:12003 spool=false pickle=false',
'10.20.40.126:12003 spool=false pickle=false',
'10.20.33.78:12003 spool=false pickle=false',
'10.20.39.19:12003 spool=false pickle=false',
'10.20.42.66:12003 spool=false pickle=false',
'10.20.34.131:12003 spool=false pickle=false',
'10.20.38.55:12003 spool=false pickle=false',
'10.20.41.75:12003 spool=false pickle=false',
'10.20.32.8:12003 spool=false pickle=false',
'10.20.37.165:12003 spool=false pickle=false'
]

carbonate.conf

[main]
DESTINATIONS = 10.20.34.114:12003,10.20.39.104:12003,10.20.40.161:12003,10.20.35.158:12003,10.20.37.70:12003,10.20.40.126:12003,10.20.33.78:12003,10.20.39.19:12003,10.20.42.66:12003,10.20.34.131:12003,10.20.38.55:12003,10.20.41.75:12003,10.20.32.8:12003,10.20.37.165:12003
REPLICATION_FACTOR = 1
HASHING_TYPE = carbon_ch

Without his patch, the hashing result is different on these nodes : 10.20.32.8 10.20.34.131 10.20.35.158 10.20.39.104 10.20.40.126 10.20.42.66

After applying the patch, carbonate and relay-ng give the same result.