quantcast / qfs

Quantcast File System
https://quantcast.atlassian.net
Apache License 2.0
643 stars 171 forks source link

VR doesn't select new primary #223

Closed MrStupnikov closed 7 years ago

MrStupnikov commented 7 years ago

VR doesn't select new primary

Steps to reproduce

. Build QFS cluster with 3 metaserver (172.16.40.3, 172.16.40.4, 172.16.40.5, configs are attached)

. Configure VR according to the instruction in /master/conf/ChunkServer.prp

. Shutdown metaserver on primary node

. Try to get VR status

Debug data

I have attached configuration files (MetaServer*.txt)

Here are links to verbose output of vr_get_status command before and after primary node shutdown:

MetaServer3.txt MetaServer4.txt MetaServer5.txt

mikeov commented 7 years ago

The following suggests that the key steps of issuing vr_reconfiguration commands were not completed successfully:

vr.epoch: 0

Epoch 0 means that the system has initial / bootstrap configuration. In this configuration only node with ID 0 can be primary, and must be able to communicate with all other nodes until all nodes are activated.

All 4 steps, including nodes activation, mentioned in the admin guide VR example https://github.com/quantcast/qfs/wiki/Administrator's-Guide https://github.com/quantcast/qfs/wiki/Administrator's-Guide would need to be successfully completed in order for meta server replication and automatic fail over to work. After successful completion of these steps VR epoch should advance to 1.

Meta server replication can be reconfigured with no FS downtime: meta server nodes can be added, removed and replaced, their IPs / ports changed. Typically each successful reconfiguration that involves node activation and/or inactivation increments VR epoch.

QFS “endurance” test script can be used to setup and experiment with VR.

For example the following will build binaries, then setup test FS with 3 VR nodes and 9 chunk servers running on the local host, then kill primary meta server and query status, do ls, and then shutdown the FS.

make && \ cd build/release && \ .../../src/test-scripts/run_endurance_mc.sh -mc-only -no-err-sim -auth 0 -vr 3 -test-dirs-prefix ./mytest/data && \ ../src/cc/tools/qfsadmin -f mytest/data3//test/meta/qfsadmin.prp -s 127.0.0.1 -p 20000 vr_get_status && \ ../src/cc/tools/qfs -cfg mytest/data3//test/cli/client.prp -fs qfs://127.0.0.1 -ls / && \ xargs kill < ./mytest/data3//test/meta/metaserver.pid && \ sleep 15 && \ ../src/cc/tools/qfsadmin -f mytest/data3//test/meta/qfsadmin.prp -s 127.0.0.1 -p 20000 vr_get_status && \ ../src/cc/tools/qfs -cfg mytest/data3/*/test/cli/client.prp -fs qfs://127.0.0.1 -ls / && \ .../../src/test-scripts/run_endurance_mc.sh -mc-only -no-err-sim -auth 0 -vr 3 -test-dirs-prefix ./mytest/data -stop

— Mike.

On Oct 22, 2017, at 10:44 AM, Alexey Stupnikov notifications@github.com wrote:

VR doesn't select new primary

Steps to reproduce

. Build QFS cluster with 3 metaserver (172.16.40.3, 172.16.40.4, 172.16.40.5, configs are attached)

. Configure VR according to the instruction in /master/conf/ChunkServer.prp

. Shutdown metaserver on primary node

. Try to get VR status

Expected result: cluster's VR status is displayed Actual result: qfs is unable to get primary node and fails to get any information. Debug data

I have attached configuration files (MetaServer*.txt)

Here are links to verbose output of vr_get_status command before and after primary node shutdown:

Success https://pastebin.com/TwAarDv5 Failure https://pastebin.com/LLwPpZDg MetaServer3.txt https://github.com/quantcast/qfs/files/1405164/MetaServer3.txt MetaServer4.txt https://github.com/quantcast/qfs/files/1405165/MetaServer4.txt MetaServer5.txt https://github.com/quantcast/qfs/files/1405166/MetaServer5.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/quantcast/qfs/issues/223, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJEWb5kwnUcyP0cGVGlpLxk9RKodqj3ks5su38OgaJpZM4QCCTg.

MrStupnikov commented 7 years ago

I have compared lab's config with my environment and found configuration issue. Thanks for your answer Mike, it was very helpful.