[bug]: lnd not resigning from leader role when disconnected from cluster

Filiprogrammer commented 1 month ago

Background

In an lnd cluster, when the leader lnd node and its etcd endpoint lose their network connection to the rest of the cluster, another node takes over. So far so good. When the disconnected lnd node and its etcd endpoint are reconnected to the network, the lnd node continues to run. This results in two lnd nodes running in a cluster at the same time, both accessing the same database simultaneously. This might potentially lead to a loss of funds.

Your environment

lnd version 0.18.1-beta
Operating system: Debian containers on Proxmox VE (Linux kernel 6.8.4-2-pve)
bitcoind version v25.0.0
Network: regtest
LXC containers:
- bitcoind
- lndetcd1
- lndetcd2
- lndetcd3

Steps to reproduce

Set up a bitcoind regtest node with the following configuration:

regtest=1

[regtest]
server=1
txindex=1
disablewallet=0
peerbloomfilters=0
rpcuser=user
rpcpassword=password
rpcport=8332
rpcallowip=0.0.0.0/0
rpcbind=0.0.0.0:8332
zmqpubhashblock=tcp://0.0.0.0:29000
zmqpubrawtx=tcp://0.0.0.0:29001
zmqpubrawblock=tcp://0.0.0.0:29002
listen=1

Create 3 Debian containers named "lndetcd1", "lndetcd2" and "lndetcd3".

Install etcd on each of the containers

apt install etcd-server etcd-client

Add the following to the /etc/default/etcd file on each container:

ETCD_NAME=etcd1
ETCD_INITIAL_CLUSTER="etcd1=http://${IP_OF_LNDETCD1}:2380,etcd2=http://${IP_OF_LNDETCD2}:2380,etcd3=http://${IP_OF_LNDETCD3}:2380"
ETCD_ADVERTISE_CLIENT_URLS=http://${OWN_IP}:2379
ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:2379
ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2380
ETCD_INITIAL_ADVERTISE_PEER_URLS=http://${OWN_IP}:2380
ETCD_MAX_TXN_OPS=16384

Change ETCD_NAME according to the respective container name.

Then start etcd on each container

systemctl start etcd.service

Check that the cluster is healthy

etcdctl endpoint health --cluster

Now install lnd on each of the containers with the following configuration changing the cluster.id accordingly:

[Application Options]
listen=0.0.0.0:9735
alias=lndetcd

[Bitcoin]
bitcoin.regtest=1
bitcoin.node=bitcoind

[Bitcoind]
bitcoind.rpcuser=user
bitcoind.rpcpass=password
bitcoind.rpchost=${IP_OF_BITCOIND}:8332
bitcoind.zmqpubrawtx=tcp://${IP_OF_BITCOIND}:29001
bitcoind.zmqpubrawblock=tcp://${IP_OF_BITCOIND}:29002

[db]
db.backend=etcd

[etcd]
db.etcd.host=127.0.0.1:2379
db.etcd.disabletls=1

[cluster]
cluster.enable-leader-election=1
cluster.leader-elector=etcd
cluster.etcd-election-prefix=cluster-leader
cluster.id=lndetcd1

Start lnd on one of the containers first and setup a wallet, then start lnd on the other containers.

Now that everything is set up, disconnect the container running the current lnd leader node from the network.

Expected behaviour

The lnd node should lose its leader status and crash or shut down.

Actual behaviour

The leader will keep running unable to access the database, since it does not have access to the majority of the cluster.

After about 1 minute, a different lnd node in the cluster takes over.

When the disconnected container is reconnected to the network, there are two active lnd nodes accessing the same database at the same time. This might potentially lead to a loss of funds.

bhandras commented 1 month ago

Currently the leader LND resigns only if it is shutting down. You can also follow this in the logs: https://github.com/lightningnetwork/lnd/blob/fdd28c8d888792ea8fde3c557ba9f2594e0a6ec8/lnd.go#L407

IIUC you're expecting that we detect network disconnect and resign? Could you simply just shut down the container?

Filiprogrammer commented 1 month ago

IIUC you're expecting that we detect network disconnect and resign?

Yes, that would be the desired effect. To be more precise: I expect us to detect when the leader lease expires and shut the LND node down so that it can be automatically restarted by e.g. systemd and go back to the "waiting to become leader" state.

Could you simply just shut down the container?

Then I would need to have some script running in the background all the time, checking that the container still has a network connection. And even then this would not be 100% reliable since there could be some unexpected network problem. (e.g. duplicate IP address, misconfigured firewall, ARP cache issues, routing issues...) In such a case, where the container would be connected to the network, but still unable to communicate with the majority of the cluster, it could still result in two leaders.

Filiprogrammer commented 1 month ago

I hacked together a script that works around this issue.

The following sh script has to be started before lnd is started:

#!/bin/sh

LND_CONF_FILE="${HOME}/.lnd/lnd.conf"

lnd_cluster_id=$(awk -F "=" '/cluster.id/ {print $2}' $LND_CONF_FILE)
election_prefix=$(awk -F "=" '/cluster.etcd-election-prefix/ {print $2}' $LND_CONF_FILE)

echo "cluster.id: $lnd_cluster_id"
echo "cluster.etcd-election-prefix: $election_prefix"

fifo=$(mktemp -u)
mkfifo "$fifo"

etcdctl watch --prefix "$election_prefix" --write-out=json > "$fifo" &
cmd_pid=$!

while IFS= read -r line <&3; do
    key_base64=$(echo "$line" | jq -re .Events[].kv.key)
    has_key=$?
    value_base64=$(echo "$line" | jq -re .Events[].kv.value)
    has_value=$?

    if [ "$has_key" -eq 0 ] && [ "$has_value" -eq 0 ]; then
        key=$(echo "$key_base64" | base64 --decode)
        value=$(echo "$value_base64" | base64 --decode)

        if [ "$value" = "$lnd_cluster_id" ]; then
            kill "$cmd_pid"
            lease_id="${key#*/}"
            echo "Found lease_id=$lease_id for $lnd_cluster_id"
        fi
    fi
done 3< "$fifo"

wait "$cmd_pid"

while true; do
    sleep 5

    etcdctl lease timetolive "$lease_id" --write-out=json > "$fifo" 2>&1 &
    cmd_pid=$!

    while IFS= read -r line <&3; do
        ttl=$(echo "$line" | jq -re .ttl)
        has_ttl=$?
        kill "$cmd_pid" 2> /dev/null
    done 3< "$fifo"

    if [ "$has_ttl" -ne 0 ]; then
        echo "Failed to get ttl"
        break
    fi

    if [ "$ttl" = "-1" ]; then
        echo "Lease expired"
        break
    fi

    echo "Lease ttl=$ttl"
done

rm "$fifo"

kill -9 $(systemctl show --property MainPID --value lnd.service)
echo "Killed lnd"

Every 5 seconds it checks whether the lease has expired. If it has, it kills the lnd process. This ensures that there is never more than one lnd leader node running.

To ensure that lnd is always started and stopped together with this script, I used a combination of Requisite=, After= and StopPropagatedFrom= when configuring the systemd services.

lnd.service:

[Unit]
Description=LND Lightning Daemon
Requisite=etcd.service
Requisite=lelem.service
After=etcd.service
After=lelem.service
StopPropagatedFrom=etcd.service
StopPropagatedFrom=lelem.service

[Service]
ExecStart=/usr/bin/lnd

User=bitcoin
Group=bitcoin

Restart=always
RestartSec=30

Type=notify

TimeoutStartSec=infinity
TimeoutStopSec=1800

ProtectSystem=full
NoNewPrivileges=true
PrivateDevices=true
MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target

lelem.service:

[Unit]
Description=LND ETCD Lease Expiry Monitor
StopPropagatedFrom=lnd.service
Upholds=lnd.service

[Service]
ExecStart=/usr/local/bin/lelem.sh

User=bitcoin
Group=bitcoin

Restart=always

Type=exec

ProtectSystem=full
NoNewPrivileges=true
PrivateDevices=true
MemoryDenyWriteExecute=true

[Install]
WantedBy=multi-user.target

This actually works. But ideally we shouldn't need this hack and lnd should handle this internally.

Roasbeef commented 1 month ago

Discussed this issue earlier today with @bhandras and we have a solution in mind, thanks for bringing this to our attention!

lightningnetwork / lnd