lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.67k stars 2.07k forks source link

LN node lost track of all the channels. #6616

Closed janvojt closed 2 years ago

janvojt commented 2 years ago

Background

I have had LN node setup with bitcoind and lnd for about 3 years, had no issues so far. Suddenly I noticed, all of my 19 channels disappeared. lncli just says I have zero channels. 1ML is still showing I have 19 channels. I am not sure where to start. Only suspicious message in the logs is the following:

[ERR] PEER: resend failed: unable to fetch channel sync messages for peer x@h:p: unable to find closed channel summary

(where x is the remote I had a channel with)

Any lncli command I try simply acts like I never had any channels, see below:

$ lncli getinfo
{
    "version": "0.14.3-beta commit=v0.14.3-beta",
    "commit_hash": "bd0c46b4fcb027af1915bd67a3da70e8ca5c6efe",
    "identity_pubkey": "xxx",
    "alias": "xxx",
    "color": "xxx",
    "num_pending_channels": 0,
    "num_active_channels": 0,
    "num_inactive_channels": 0,
    "num_peers": 19,
    "block_height": 739227,
    "block_hash": "0000000000000000000230d1d7d748e6d61adf0bf9fbca769549ca3e0cf8c921",
    "best_header_timestamp": "1654327446",
    "synced_to_chain": true,
    "synced_to_graph": true,
    "testnet": false,
    "chains": [
        {
            "chain": "bitcoin",
            "network": "mainnet"
        }
    ],
...
$ lncli listchannels
{
    "channels": [
    ]
}
$ lncli listchannels
{
    "channels": [
    ]
}
$ lncli walletbalance
{
    "total_balance": "x",
    "confirmed_balance": "x",
    "unconfirmed_balance": "0",
    "locked_balance": "0",
    "account_balance": {
        "default": {
            "confirmed_balance": "x",
            "unconfirmed_balance": "0"
        }
    }
}

(x above is the same number)

I have backup process setup for copying the channel.backup on every change, so should have history in there, but not sure how to decode it to have a look. From the size of the file it looks like the SCBs should be still there. I guess my best bet to investigate this is to decode channel.db and see what is in there, but not sure how to do that (is there a tool for that?).

Your environment

Steps to reproduce

No idea, if I were to take a wild guess, it might have been caused by electricity blackout I had a week ago. But it lasted only a few seconds and node booted up OK (was synced to chain later). I have about a week of logs, but I am not sure when the problem happened. I do not see any major problems in the logs.

Expected behaviour

I expect to see my channels as before. Also 1ML still shows my channels as active.

If the channels were closed on the remote side, I would expect the channels to be listed among closed channels (lncli closedchannels). This I doubt happened, because I had 19 channels, and at least 3 of them were zombies - the remote host being offline for years.

Actual behaviour

LND acts like I never had any channels.

guggero commented 2 years ago

What's the size of your .lnd/data/graph/mainnet/channel.db file? Did you do anything to that file or did it get deleted by accident?

janvojt commented 2 years ago

No, I did not do anything to the file. It did not get deleted. Maybe it got corrupted because of power outage?

Anyway, the file is there and is roughly 500MB big:

# ls -la data/graph/mainnet/channel.db 
-rw------- 1 bitcoin bitcoin 516329472 Jun  5 14:26 data/graph/mainnet/channel.db
Roasbeef commented 2 years ago

Have you checked both lncli closedchannels and lncli pendingchannels?

janvojt commented 2 years ago

Yes, I checked both. They were and are still empty:

$ lncli pendingchannels
{
    "total_limbo_balance": "0",
    "pending_open_channels": [
    ],
    "pending_closing_channels": [
    ],
    "pending_force_closing_channels": [
    ],
    "waiting_close_channels": [
    ]
}
$ lncli closedchannels
{
    "channels": [
    ]
}
guggero commented 2 years ago

Can you try if chantools compactdb runs successfully and changes anything?

Maybe it got corrupted because of power outage?

Yes, the underlying bbolt database is very delicate and a power outage can lead to all sorts of data corruption. Usually this is noticed during compaction (you can also turn on db.bolt.auto-compact=1 in lnd to run the compaction regularly).

janvojt commented 2 years ago
$ chantools-linux-amd64-v0.10.4/chantools compactdb --sourcedb channel.db --destdb compacted.db | head -n 10
2022-06-07 17:48:42.608 [INF] CHAN: chantools version v0.10.4 commit 
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 03801dc1fc241ebc9f9fcfcc343469ecfb429555a8312ecd904d5f406cc8c583a80a818a00033d0001 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 03801dc1fc241ebc9f9fcfcc343469ecfb429555a8312ecd904d5f406cc8c583a80aca0b0009f40001 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 03802f08967cdd0dac6b008ce27881695c2decd7c91392c97fa4fc067fb9d024dd096cc00005b60000 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 03802f08967cdd0dac6b008ce27881695c2decd7c91392c97fa4fc067fb9d024dd0aa6650007260000 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 038032301cdc87a98fd30039028ad9d9e4008442b957ad346a2489835eb37abe280ae4960007e90000 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 038032301cdc87a98fd30039028ad9d9e4008442b957ad346a2489835eb37abe280ae49e0007e00000 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 038032301cdc87a98fd30039028ad9d9e4008442b957ad346a2489835eb37abe280af0680003ea0000 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 038032301cdc87a98fd30039028ad9d9e4008442b957ad346a2489835eb37abe280b13e70007b00001 was nil! Probable data corruption suspected.
2022-06-07 17:48:42.608 [ERR] CHAN: Bucket 038032301cdc87a98fd30039028ad9d9e4008442b957ad346a2489835eb37abe280b13ed0008880001 was nil! Probable data corruption suspected.

...and same error messages with different bucket IDs go on and on, altogether 21325 lines.

It did produce the compacted db, is it safe to use?

$ ls -la *.db
-rw------- 1 janvojt janvojt 516329472 Jun  7 17:35 channel.db
-rw------- 1 janvojt janvojt 109268992 Jun  7 17:59 compacted.db
  1. Can I recover somehow, or did I just loose all the channels and my only way forward is to close them using backed up SCB file?
  2. Is there any way to prevent coming into this situation in future (other than UPS to ensure I never get a blackout and RAID)?

Thank you guys for your help!

guggero commented 2 years ago

It did produce the compacted db, is it safe to use?

Given the amount of error messages, I would strongly advise against using the DB again, at least after closing out the channels. You can try starting lnd with the compacted DB and see if the channels are back (or some of them). Then try to cooperatively close them to save on some fees. But long term you need to close the channels and start with a fresh DB, unfortunately.

UPS is a good start. You can also take a look at the external replicated DB options such as etcd or Postgres which are less delicate. Soon there hopefully also will be an SQLite option.

janvojt commented 2 years ago

Honestly, this is quite disappointing. Using a "delicate" database, that cannot recover from a power failure, as a default database in software that handles financial transactions sounds like a terrible architectural decision.

I will look into using Postgres and starting from scratch, did not know I can use different database for lnd. Thanks for your time.

guggero commented 2 years ago

Honestly, this is quite disappointing. Using a "delicate" database, that cannot recover from a power failure, as a default database in software that handles financial transactions sounds like a terrible architectural decision.

Well, I agree. That decision was made a long time ago and has always been quite a big pain. That's why we're putting a lot of work into migrating to other storage solutions. But that takes time. And the tradeoffs and safety considerations are documented clearly. And any other single instance (e.g. SQLite or single node Postgres) DB probably also doesn't handle power outages very well. So the real long-term solution is to use a synchronized remote DB cluster.

Closing the issue as this was caused by data corruption due to a power outage.