openark / orchestrator

MySQL replication topology management and HA
Apache License 2.0
5.64k stars 937 forks source link

submit-masters-to-kv-stores fails to consistently submit clusters to zookeeper #741

Open flopex opened 6 years ago

flopex commented 6 years ago

orchestrator-client -c submit-masters-to-kv-stores or /api/submit-masters-to-kv-stores results in zk: node already exists

[martini] Started GET /api/submit-masters-to-kv-stores for 10.255.53.176
2018-11-28 14:29:48 DEBUG orchestrator/raft: applying command 22939: put-key-value
2018-11-28 14:29:48 DEBUG creating: /databases/mysql/master/cluster1/ipv6
2018-11-28 14:29:48 INFO Connected to zk_host:22181
2018-11-28 14:29:48 INFO Authenticated: id=459409973546798301, timeout=4000
2018-11-28 14:29:48 INFO Re-submitting `0` credentials after reconnect
2018-11-28 14:29:48 DEBUG create status for /databases/mysql/master/cluster1/ipv6: , zk: node already exists
2018-11-28 14:29:48 DEBUG creating: /databases/mysql/master/cluster1
2018-11-28 14:29:48 DEBUG create status for /databases/mysql/master/cluster1: , zk: node already exists
2018-11-28 14:29:48 DEBUG creating: /databases/mysql/master
2018-11-28 14:29:48 DEBUG create status for /databases/mysql/master: , zk: node already exists
2018-11-28 14:29:48 DEBUG creating: /databases/mysql
2018-11-28 14:29:48 DEBUG create status for /databases/mysql: , zk: node already exists
2018-11-28 14:29:48 DEBUG creating: /databases
2018-11-28 14:29:48 DEBUG create status for /databases: , zk: node already exists
2018-11-28 14:29:48 DEBUG create status for /databases: , zk: node already exists
2018-11-28 14:29:48 DEBUG create status for /databases/mysql: , zk: node already exists
2018-11-28 14:29:48 DEBUG create status for /databases/mysql/master: , zk: node already exists
2018-11-28 14:29:49 DEBUG create status for /databases/mysql/master/cluster1: , zk: node already exists
2018-11-28 14:29:49 DEBUG create status for /databases/mysql/master/cluster1/ipv6: , zk: node already exists
2018-11-28 14:29:49 INFO Recv loop terminated: err=EOF
2018-11-28 14:29:49 INFO Send loop terminated: err=<nil>
[martini] Completed 500 Internal Server Error in 1.22661914s

On the first run it submitted our first cluster to zookeeper but fails to submit the remaining clusters

Our zk tree looks like this (we expected more clusters to show up)

Getting children stored at node /databases/mysql/master/
/databases/mysql/master
├── cluster1

We ran submit-masters-to-kv-stores multiple times and eventually was able to submit all clusters but this is sub-optimal.

I see that this was a previously reported issue (#619) and it was fixed in (#620), but we are still seeing it on the latest version (3.0.13)

I also tested this in a docker playground with same results (using latests SHA)

$ docker-compose up
$ docker-compose exec orchestrator1 resources/bin/orchestrator-client -debug -c discover -i db1
$ docker-compose exec orchestrator1 resources/bin/orchestrator-client -debug -c submit-masters-to-kv-stores
zk: node already exists

docker_orchestrator_submit_masters_to_kv_stores_error.txt

shlomi-noach commented 6 years ago

Thank you. Not sure why #620 does not solve it on your side, but regardless SubmitMastersToKvStores is wrong to bail out on the first error. https://github.com/github/orchestrator/pull/549/commits/6e5c0f563a8ca45ede486f3b7c9eb25dc9615ffa (piggy riding a related PR) let's the iteration proceed even in face of errors.