jjneely / buckytools

Go implementation of useful tools for dealing with Graphite's Whisper DBs and Carbon hashing
Other
87 stars 21 forks source link

Bucky rebalance loses points if a metric is failover'ed on multiple nodes. #34

Open ecsumed opened 5 years ago

ecsumed commented 5 years ago

In case of a failover scenario, where metrics meant for a specific node end up on multiple other nodes, the rebalance randomly (or it seems) tends to lose metrics. This only happens when the metric ends up on multiple other nodes instead of a single server.

daemon commands:

buckyd -node node1 -sparse -hash fnv1a node1 node2 node3 # on node1
buckyd -node node2 -sparse -hash fnv1a node1 node2 node3 # on node2
buckyd -node node3 -sparse -hash fnv1a node1 node2 node3 # on node3

Now suppose node 2 goes down for 5 minutes, here's the data for a specific metric:

node 2 --- BEFORE rebalance
1561724280 114.000000
1561724340 1000.000000
1561724400 378.000000
1561724460 None
1561724520 None
1561724580 None
1561724640 None
1561724700 None
1561724760 None
1561724820 95.000000
1561724880 465.000000

node 1 --- failover'ed data on node 1
1561724280 None
1561724340 None
1561724400 None
1561724460 394.000000
1561724520 794.000000
1561724580 None
1561724640 686.000000
1561724700 None
1561724760 None
1561724820 None
1561724880 None

node 3 -- failover'ed data on node 3
1561724280 None
1561724340 None
1561724400 None
1561724460 None
1561724520 None
1561724580 35.000000
1561724640 None
1561724700 863.000000
1561724760 858.000000
1561724820 None
1561724880 None
1561724940 None
1561725000 None

node 2 -- AFTER rebalance
1561724280 114.000000
1561724340 1000.000000
1561724400 378.000000
1561724460 394.000000
1561724520 794.000000
1561724580 None      <------- This is lost, even though it exists on node 3
1561724640 686.000000
1561724700 863.000000
1561724760 858.000000
1561724820 95.000000
1561724880 465.000000

Here's another example

node 2 -- BEFORE rebalance
1561730820 417.000000
1561730880 654.000000
1561730940 559.000000
1561731000 None
1561731060 None
1561731120 None
1561731180 None
1561731240 None
1561731300 None
1561731360 None
1561731420 None
1561731480 670.000000
1561731540 99.000000
1561731600 202.000000
1561731660 304.000000
1561731720 502.000000

node 1 -- failover'ed data
1561730820 None
1561730880 None
1561730940 None
1561731000 None
1561731060 366.000000
1561731120 766.000000
1561731180 None
1561731240 887.000000
1561731300 296.000000
1561731360 None
1561731420 681.000000
1561731480 None
1561731540 None
1561731600 None
1561731660 None
1561731720 None

node 3 -- failover'ed data
1561730820 None
1561730880 None
1561730940 None
1561731000 853.000000
1561731060 None
1561731120 None
1561731180 3.000000
1561731240 None
1561731300 None
1561731360 247.000000
1561731420 None
1561731480 None
1561731540 None
1561731600 None
1561731660 None
1561731720 None

node 2 -- AFTER rebalance
1561730820 417.000000
1561730880 654.000000
1561730940 559.000000
1561731000 None        <---- point lost even though it exists on node 3
1561731060 366.000000
1561731120 766.000000
1561731180 None        <----point lost even though it exists on node 3
1561731240 887.000000
1561731300 296.000000
1561731360 None         <---- point lost even though it exists on node 3
1561731420 681.000000
1561731480 670.000000
1561731540 99.000000
1561731600 202.000000
1561731660 304.000000
1561731720 502.000000

rebalance command: bucky rebalance -f bucky version: 0.4.1

Expected results: Points should not be lost

jjneely commented 5 years ago

There is #19 as well. I don't think this is what you are hitting, but might be useful for context and patches. I used to spend a lot of time with Graphite, less so now. (I usually help people migrate off of it instead!)

The algorithm that rebalance uses is (at least the goal was) a direct port, bug for bug of what whisper-fill does. That algorithm does end up dropping data points on certain boundary cases....and the Go version I wrote duplicates those. I believe this is what you are hitting with this example. You might be interested in fill_test.go which has some tests and demonstrations of exactly this. Run it with go test in the fill/ directory.