lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.69k stars 2.08k forks source link

[bug]: postgres error: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) #8049

Closed blckbx closed 6 months ago

blckbx commented 1 year ago

Background

Been observing an issue with PostgreSQL DB since version 0.17 regarding serialization of reads/writes (probably because LND now exposes SQL errors to logs?). Before 0.17 I used to compact on-the-fly while running lnd in background. With 0.17 I ran across this serialization error when compacting (vaccuum full). Today this error also led to a force close of a channel (no compaction happened at this time):

2023-10-01 13:21:03.430 [ERR] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): failing link: unable to accept revocation: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001) with error: invalid revocation

followed by multiple

2023-10-01 13:21:38.721 [INF] CNCT: Attempting to update ContractSignals for ChannelPoint(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3)
2023-10-01 13:21:39.055 [INF] PEER: Peer(020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217): unable to read message from peer: read next header: EOF
2023-10-01 13:21:39.055 [WRN] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): error when syncing channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error
2023-10-01 13:21:39.055 [INF] PEER: Peer(020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217): disconnecting 020d1617e27ac022395352f2b3774969593d3d6ddff6fb117d820a9dda8da45217@23.88.103.101:40009, reason: read handler closed
2023-10-01 13:21:39.055 [ERR] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): failing link: unable to synchronize channel states: first message sent to sync should be ChannelReestablish, instead received: *lnwire.Error with error: unable to resume channel, recovery required
2023-10-01 13:21:39.055 [INF] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): exited
2023-10-01 13:21:39.155 [INF] HSWC: ChannelLink(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): stopping

resulting in

2023-10-01 13:44:00.128 [DBG] NTFN: Found spend of outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3: spend_tx=ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891, block_height=810154
2023-10-01 13:44:00.385 [DBG] NTFN: Found spend of outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3: spend_tx=ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891, block_height=810154
2023-10-01 13:44:00.385 [DBG] NTFN: Spend request outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3 spent at tip=810154
2023-10-01 13:44:10.989 [INF] NTFN: Dispatching confirmed spend notification for outpoint=61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3, script=0 f42feb87b57ea3d00cf87591efeb58fd9dbc8ec913f35ed3f003780bf93d20b3 at current height=810154: ef66de761c1253ae85d7d3deb4f3791c52d44484b9a51084c5da3d7262574891[0] spending 61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3 at height=810154
2023-10-01 13:44:10.989 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=chainTrigger from state=StateDefault
2023-10-01 13:44:10.990 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): new block (height=810154) examining active HTLC's
2023-10-01 13:44:10.990 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:10.991 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): no actions for chain trigger, terminating
2023-10-01 13:44:10.993 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): terminating at state=StateDefault
2023-10-01 13:44:11.257 [INF] CNCT: Unilateral close of ChannelPoint(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3) detected
2023-10-01 13:44:11.258 [INF] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): remote party has closed channel out on-chain
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateDefault
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): new block (height=810154) examining active HTLC's
2023-10-01 13:44:22.062 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:22.428 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateContractClosed
2023-10-01 13:44:22.463 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): checking commit chain actions at height=810154, in_htlc_count=0, out_htlc_count=0
2023-10-01 13:44:22.621 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): inserting 1 contract resolvers
2023-10-01 13:44:22.847 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting to resolve *contractcourt.commitSweepResolver
2023-10-01 13:44:22.847 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): contract *contractcourt.commitSweepResolver not yet resolved
2023-10-01 13:44:22.978 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): attempting state step with trigger=remoteCloseTrigger from state=StateWaitingFullResolution
2023-10-01 13:44:22.978 [INF] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): still awaiting contract resolution
2023-10-01 13:44:22.997 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): still have unresolved contract: *contractcourt.commitSweepResolver
2023-10-01 13:44:22.997 [DBG] CNCT: ChannelArbitrator(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): terminating at state=StateWaitingFullResolution
2023-10-01 13:44:23.136 [DBG] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): commit conf_height=810154, unlock_height=810154
2023-10-01 13:44:23.136 [DBG] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): isDelayedOutput=false, isLocalCommitTx=false
2023-10-01 13:44:23.136 [INF] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): Sweeping with witness type: CommitmentNoDelayTweakless
2023-10-01 13:44:23.136 [INF] CNCT: *contractcourt.commitSweepResolver(61430ab592efc91b2bc0241867c2d204d001f2f733b9f237ebf5f56d1996cbf6:3): sweeping commit output

Your environment

lnd.conf:

db.postgres.timeout=0
db.postgres.maxconnections=50

Steps to reproduce

Reproduction method unknown.

Expected behaviour

No SQL error.

Actual behaviour

Serialization error in database.

Logs: FC Log: fc.log SQLSTATE 40001 while compacting: compacting.log

tl;dr: I think this is the equivalent of #7869 for postgres. I'll try to set db.postgres.timeout to 10m. Although not sure why lnd is not retrying transactions as specified in #7960 (running latest 0.17-rc6).

guggero commented 1 year ago

Since https://github.com/lightningnetwork/lnd/pull/7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001 as a re-tryable error. Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first? Things should also get better with https://github.com/lightningnetwork/lnd/pull/7992, which hopefully will make it in for v0.17.1-beta.

blckbx commented 1 year ago

Since #7927 we do re-try on certain SQL failures. Perhaps we just need to catch SQLSTATE 40001 as a re-tryable error.

Reading https://github.com/lightningnetwork/lnd/blob/master/sqldb/sqlerrors.go#L70-L74 with pgerr.SerializationFailure referencing https://github.com/jackc/pgerrcode/blob/master/errcode.go#L218 - which actually is SQLSTATE 40001 - I assumed this already is the case.

Setting the DB timeout probably isn't a bad idea, though I wouldn't set the value that high. Maybe try with 1 or 5 minutes first?

Okay, I'll try 5 minutes and see how it goes. Usually compaction takes less than that.

Things should also get better with #7992, which hopefully will make it in for v0.17.1-beta.

Yeah, can't wait to try out #7992! Although a huge key-value table still is a restriction regarding (b)locking techniques, so also looking forward to 0.18 and first SQL schemes.

Roasbeef commented 1 year ago

So https://github.com/lightningnetwork/lnd/pull/7992 also fixes issues with the retry logic: before certain errors weren't properly wrapped, so they weren't detected as serialization errors.

Interesting that you're running into it as is though, since we have an in process mutex that should limit to just a single writer.

EDIT: ah reading it again, I see you're running a background vacuum, that could trigger the retry logic there.

blckbx commented 1 year ago

Okay, this explains why vacuuming the database with the same database user as lnd is set up with throws the error. I can reproduce the error by background vaccuuming regardless of timeout setting.

Then I'm not sure what caused the serialization error that led to the force close. Could it be heavy usage due to parallelization of rebalancing and forwarding?

I'll keep an eye on it.

blckbx commented 1 year ago

Today I compacted the database manually with user "postgres" instead. This correctly pauses all db actions for lnd and resumes them after compaction has finished. No errors in lnd's log. Still not sure what caused the force close now but I guess a concurrent database access with the same lnd user is the closest I could think of.

blckbx commented 1 year ago

Tried another compaction run which caused LND to shutdown again. With config:

db.postgres.timeout=2m

Compacting duration:

200.74 sec

Compaction Starting Time: 2023-10-08 16:41:50 First Error Sighting: 2023-10-08 16:42:10

2023-10-08 16:42:10.893 [CRT] CHDB: Caught unhandled error: ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001)
2023-10-08 16:42:10.894 [INF] CHDB: Sending request for shutdown
2023-10-08 16:42:10.894 [INF] LTND: Received shutdown request.
2023-10-08 16:42:10.894 [INF] LTND: Shutting down...
saubyk commented 11 months ago

This issue will be addressed with https://github.com/lightningnetwork/lnd/pull/7992