lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.67k stars 2.07k forks source link

Native instance lock when using Postgres #6894

Open joostjager opened 2 years ago

joostjager commented 2 years ago

With bbolt, it isn't possible to run two lnd instances using the same database simultaneously. Bbolt is protecting against that. WIth postgres however, there is no such mechanism in place by default. Running two instances with postgres in parallel could have bad consequences.

In a k8s setup, this could happen more easily than one would think. Rolling deployments for example bring up the new instance before terminating the old one.

Leader election is the current solution that lnd offers, but it requires an etcd instance. See also https://github.com/lightningnetwork/lnd/issues/6887

An alternative could be to use a postgres-based lock using for example https://github.com/cirello-io/pglock.

bhandras commented 2 years ago

Another alternative vs locking could be to implement native k8s leader election support for LND.

joostjager commented 2 years ago

Yes, considered that too. I don't really like the fact that by default, there is no safety mechanism.

Roasbeef commented 2 years ago

Leader election is abstracted away, so one could use this interface to add w/e other options they want: https://github.com/lightningnetwork/lnd/blob/master/cluster/interface.go#L13-L26

Then ideally we also promote this to the top-level config as well. This would let people instantiate a new version of lnd w/ a custom leader elector that could do arbitrary things like wait for a text message to know who's the leader with some telecom API.

FWIW, the etcd distributed lease API has gone under some pretty significant evaluation: https://jepsen.io/analyses/etcd-3.4.3. This led to a larger overhaul two years ago: https://etcd.io/blog/2020/jepsen-343-results/. This led to a lot of fundamental fixes for the API to ensure correctness/safety. After all, this is the very same leader election API that k8s uses under the hood for its own synchronization.

The library you linked appears to do something sort of custom, and doesn't try to use postgres' advisory lock feature. There're also some pitfalls in this direction re rolling a distributed lease on top of the base feature: https://rclayton.silvrback.com/distributed-locking-with-postgres-advisory-locks.

etcd doesn't excel as a general purpose database, but it is pretty good at the original intention: loosely coupled distributed system synchronization and service discovery. Though I do agree it'd be nice for those running just postgres (tho if they're running k8s they already have the etcd API available) to implement leader election.

I don't really like the fact that by default, there is no safety mechanism.

What do you mean by this?

joostjager commented 2 years ago

Yes, not totally sure about that postgres library. Two bigger projects (few thousand stars) use it, so that's something. The only thing that I like about it is that it doesn't need an additional system besides postgres itself which lnd may already be using.

I don't think advisory locks cut it in all circumstances. As soon as the db connection is lost, the lock is released. If there is additional processing happening such as sending messages to peers or other non-db external activity, this can still occur in parallel with a new instance that has already obtained the lock.

I don't really like the fact that by default, there is no safety mechanism.

What I mean by this is that if you run lnd with default settings on postgres, there is no protection against multiple instances accessing the db in parallel. With bbolt you didn't need to worry about that.