leo-project / leofs

The LeoFS Storage System
https://leo-project.net/leofs/
Apache License 2.0
1.55k stars 155 forks source link

[all] Reliable way to detect when a node is ready #852

Closed windkit closed 6 years ago

windkit commented 7 years ago

bin/leo_{manager/storage/gatway}/ returns ok after signaling the startup and there is no reliable way to detect when a node is ready (may be log parsing, but it depends on log level too)

It is difficult to run integration script as we can only rely on fixed time wait.

The cluster fails to start and seems there is no way to fix it after a failed startup

Slave manager fails when Master manger is up but mnesia is not ready by the time. Slave manager would fail afterwards and there is no way to fix it.

Add a timer:sleep() in leo_manager_sup:create_mnesia_table_1 can easily reproduce the situation

kunaltyagi commented 7 years ago

Would pinging the node (using path_to_binary ping) and checking for error return value be sufficient? Or is the system only partially online when it starts replying pong to the ping messages? Something like

#! /usr/bin/env bash

declare -a part=("leo_manager" "leo_storage" "leo_gateway")

for i in "${part[@]}"
do
    for j in ./package/"$i"*
    do
        k="$j"/bin/"$i"
        echo "Starting ""$k"
        $k start
        # we can redirect stdout to /dev/null (Nothing on stderr)
        $k ping  # ping once to get error value
        while [ $? -ne 0 ];
        do
            sleep 1
            $k ping  # keep pinging till we get the ok signal
        done
    done
done
windkit commented 7 years ago

@kunaltyagi Thanks for the suggestion! Haven't thought of that before!

I think node would respond to ping when net_adm is up. We can further extend the idea, check if leo_manager / leo_storage / leo_gateway application is up or not.

What do you think? @mocchira @yosukehara

yosukehara commented 7 years ago

@windkit Actually, regarding Kunal's suggestion, I proposed its way to Kunal yesterday's night.

mocchira commented 7 years ago

@windkit Thanks for shedding the light this problem.

It is difficult to run integration script as we can only rely on fixed time wait.

Under the system having systemd enabled, I and @vstax are now working on trying to grasp when the node is up as much precisely as possible by adopting notify type of services (https://github.com/leo-project/leofs/issues/840) and also it enable us to start each node while taking the dependencies (ex. slave start after master and then storage(s), gateway(s)) into account. so the life will become easier/more efficient than ever once it's landed.

Slave manager fails when Master manger is up but mnesia is not ready by the time. Slave manager would fail afterwards and there is no way to fix it.

Yes this has been problematic to us so the last year I filed the issue here https://github.com/leo-project/leofs/issues/562. That said, there are two options