Closed windkit closed 6 years ago
Would pinging the node (using path_to_binary ping
) and checking for error return value be sufficient? Or is the system only partially online when it starts replying pong to the ping messages?
Something like
#! /usr/bin/env bash
declare -a part=("leo_manager" "leo_storage" "leo_gateway")
for i in "${part[@]}"
do
for j in ./package/"$i"*
do
k="$j"/bin/"$i"
echo "Starting ""$k"
$k start
# we can redirect stdout to /dev/null (Nothing on stderr)
$k ping # ping once to get error value
while [ $? -ne 0 ];
do
sleep 1
$k ping # keep pinging till we get the ok signal
done
done
done
@kunaltyagi Thanks for the suggestion! Haven't thought of that before!
I think node would respond to ping
when net_adm
is up.
We can further extend the idea, check if leo_manager
/ leo_storage
/ leo_gateway
application is up or not.
What do you think? @mocchira @yosukehara
@windkit Actually, regarding Kunal's suggestion, I proposed its way to Kunal yesterday's night.
@windkit Thanks for shedding the light this problem.
It is difficult to run integration script as we can only rely on fixed time wait.
Under the system having systemd enabled, I and @vstax are now working on trying to grasp when the node is up as much precisely as possible by adopting notify type of services (https://github.com/leo-project/leofs/issues/840) and also it enable us to start each node while taking the dependencies (ex. slave start after master and then storage(s), gateway(s)) into account. so the life will become easier/more efficient than ever once it's landed.
Slave manager fails when Master manger is up but mnesia is not ready by the time. Slave manager would fail afterwards and there is no way to fix it.
Yes this has been problematic to us so the last year I filed the issue here https://github.com/leo-project/leofs/issues/562. That said, there are two options
bin/leo_{manager/storage/gatway}/
returnsok
after signaling the startup and there is no reliable way to detect when a node is ready (may be log parsing, but it depends on log level too)It is difficult to run integration script as we can only rely on fixed time wait.
The cluster fails to start and seems there is no way to fix it after a failed startup
Slave manager fails when Master manger is up but mnesia is not ready by the time. Slave manager would fail afterwards and there is no way to fix it.
Add a
timer:sleep()
inleo_manager_sup:create_mnesia_table_1
can easily reproduce the situation