Closed blckbx closed 6 months ago
Since https://github.com/lightningnetwork/lnd/pull/7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001
as a re-tryable error.
Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?
Things should also get better with https://github.com/lightningnetwork/lnd/pull/7992, which hopefully will make it in for v0.17.1-beta
.
Since #7927 we do re-try on certain SQL failures. Perhaps we just need to catch
SQLSTATE 40001
as a re-tryable error.
Reading https://github.com/lightningnetwork/lnd/blob/master/sqldb/sqlerrors.go#L70-L74 with pgerr.SerializationFailure
referencing https://github.com/jackc/pgerrcode/blob/master/errcode.go#L218 - which actually is SQLSTATE 40001 - I assumed this already is the case.
Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?
Okay, I'll try 5 minutes and see how it goes. Usually compaction takes less than that.
Things should also get better with #7992, which hopefully will make it in for
v0.17.1-beta
.
Yeah, can't wait to try out #7992! Although a huge key-value table still is a restriction regarding (b)locking techniques, so also looking forward to 0.18 and first SQL schemes.
So https://github.com/lightningnetwork/lnd/pull/7992 also fixes issues with the retry logic: before certain errors weren't properly wrapped, so they weren't detected as serialization errors.
Interesting that you're running into it as is though, since we have an in process mutex that should limit to just a single writer.
EDIT: ah reading it again, I see you're running a background vacuum, that could trigger the retry logic there.
Okay, this explains why vacuuming the database with the same database user as lnd is set up with throws the error. I can reproduce the error by background vaccuuming regardless of timeout setting.
Then I'm not sure what caused the serialization error that led to the force close. Could it be heavy usage due to parallelization of rebalancing and forwarding?
I'll keep an eye on it.
Today I compacted the database manually with user "postgres" instead. This correctly pauses all db actions for lnd and resumes them after compaction has finished. No errors in lnd's log. Still not sure what caused the force close now but I guess a concurrent database access with the same lnd user is the closest I could think of.
Tried another compaction run which caused LND to shutdown again. With config:
db.postgres.timeout=2m
Compacting duration:
200.74 sec
Compaction Starting Time: 2023-10-08 16:41:50 First Error Sighting: 2023-10-08 16:42:10
2023-10-08 16:42:10.893 [CRT] CHDB: Caught unhandled error: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001)
2023-10-08 16:42:10.894 [INF] CHDB: Sending request for shutdown
2023-10-08 16:42:10.894 [INF] LTND: Received shutdown request.
2023-10-08 16:42:10.894 [INF] LTND: Shutting down...
This issue will be addressed with https://github.com/lightningnetwork/lnd/pull/7992
Background
Been observing an issue with PostgreSQL DB since version 0.17 regarding serialization of reads/writes (probably because LND now exposes SQL errors to logs?). Before 0.17 I used to compact on-the-fly while running lnd in background. With 0.17 I ran across this serialization error when compacting (vaccuum full). Today this error also led to a force close of a channel (no compaction happened at this time):
followed by multiple
resulting in
Your environment
lnd
: 0.17-rc6uname -a
on *Nix): Linux 5.15.0-84-generic 93-Ubuntu SMP Tue Sep 5 17:16:10 UTC 2023 x86_64btcd
,bitcoind
, or other backend: bitcoind v25lnd.conf:
Steps to reproduce
Reproduction method unknown.
Expected behaviour
No SQL error.
Actual behaviour
Serialization error in database.
Logs: FC Log: fc.log SQLSTATE 40001 while compacting: compacting.log
tl;dr: I think this is the equivalent of #7869 for postgres. I'll try to set
db.postgres.timeout
to10m
. Although not sure why lnd is not retrying transactions as specified in #7960 (running latest 0.17-rc6).