Closed MrStupnikov closed 7 years ago
The following suggests that the key steps of issuing vr_reconfiguration commands were not completed successfully:
vr.epoch: 0
Epoch 0 means that the system has initial / bootstrap configuration. In this configuration only node with ID 0 can be primary, and must be able to communicate with all other nodes until all nodes are activated.
All 4 steps, including nodes activation, mentioned in the admin guide VR example https://github.com/quantcast/qfs/wiki/Administrator's-Guide https://github.com/quantcast/qfs/wiki/Administrator's-Guide would need to be successfully completed in order for meta server replication and automatic fail over to work. After successful completion of these steps VR epoch should advance to 1.
Meta server replication can be reconfigured with no FS downtime: meta server nodes can be added, removed and replaced, their IPs / ports changed. Typically each successful reconfiguration that involves node activation and/or inactivation increments VR epoch.
QFS “endurance” test script can be used to setup and experiment with VR.
For example the following will build binaries, then setup test FS with 3 VR nodes and 9 chunk servers running on the local host, then kill primary meta server and query status, do ls, and then shutdown the FS.
make && \ cd build/release && \ .../../src/test-scripts/run_endurance_mc.sh -mc-only -no-err-sim -auth 0 -vr 3 -test-dirs-prefix ./mytest/data && \ ../src/cc/tools/qfsadmin -f mytest/data3//test/meta/qfsadmin.prp -s 127.0.0.1 -p 20000 vr_get_status && \ ../src/cc/tools/qfs -cfg mytest/data3//test/cli/client.prp -fs qfs://127.0.0.1 -ls / && \ xargs kill < ./mytest/data3//test/meta/metaserver.pid && \ sleep 15 && \ ../src/cc/tools/qfsadmin -f mytest/data3//test/meta/qfsadmin.prp -s 127.0.0.1 -p 20000 vr_get_status && \ ../src/cc/tools/qfs -cfg mytest/data3/*/test/cli/client.prp -fs qfs://127.0.0.1 -ls / && \ .../../src/test-scripts/run_endurance_mc.sh -mc-only -no-err-sim -auth 0 -vr 3 -test-dirs-prefix ./mytest/data -stop
— Mike.
On Oct 22, 2017, at 10:44 AM, Alexey Stupnikov notifications@github.com wrote:
VR doesn't select new primary
Steps to reproduce
. Build QFS cluster with 3 metaserver (172.16.40.3, 172.16.40.4, 172.16.40.5, configs are attached)
. Configure VR according to the instruction in /master/conf/ChunkServer.prp
. Shutdown metaserver on primary node
. Try to get VR status
Expected result: cluster's VR status is displayed Actual result: qfs is unable to get primary node and fails to get any information. Debug data
I have attached configuration files (MetaServer*.txt)
Here are links to verbose output of vr_get_status command before and after primary node shutdown:
Success https://pastebin.com/TwAarDv5 Failure https://pastebin.com/LLwPpZDg MetaServer3.txt https://github.com/quantcast/qfs/files/1405164/MetaServer3.txt MetaServer4.txt https://github.com/quantcast/qfs/files/1405165/MetaServer4.txt MetaServer5.txt https://github.com/quantcast/qfs/files/1405166/MetaServer5.txt — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/quantcast/qfs/issues/223, or mute the thread https://github.com/notifications/unsubscribe-auth/ACJEWb5kwnUcyP0cGVGlpLxk9RKodqj3ks5su38OgaJpZM4QCCTg.
I have compared lab's config with my environment and found configuration issue. Thanks for your answer Mike, it was very helpful.
VR doesn't select new primary
Steps to reproduce
. Build QFS cluster with 3 metaserver (172.16.40.3, 172.16.40.4, 172.16.40.5, configs are attached)
. Configure VR according to the instruction in /master/conf/ChunkServer.prp
. Shutdown metaserver on primary node
. Try to get VR status
Debug data
I have attached configuration files (MetaServer*.txt)
Here are links to verbose output of vr_get_status command before and after primary node shutdown:
MetaServer3.txt MetaServer4.txt MetaServer5.txt