tyzbit commented 6 years ago

Background

I have 9 lnd's sharing a single Bitcoind backend on a single server. I didn't open a channel with any of them in order to see this issue. I funded the first address generated from each lnd's wallet with some satoshis(which confirmed), and destroyed each lnd. Then, I restored each using the seed phrase, with multiple re-synchronizing from Bitcoind at once. Some of them (it seems the first ones to get a connection to Bitcoind) restored fine, with funds in their wallet. Others (the later ones to connect) seemed to miss their funding transactions. The address look-ahead window should have caught the funding transactions, as the addresses I sent to were the first generated addresses for each seed.

Your environment

version of lnd = 0.4.1-beta commit=09c997983127bb1e575d97c69aa364a82cfb1df7
which operating system (uname -a on *Nix) = Linux provingapoint 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
version of btcd, bitcoind, or other backend = Bitcoin Core Daemon version v0.16.0.0-g4b4d7eb255
any other relevant environment details = All running in Docker containers, which probably exacerbated the connectivity issues.

I've confirmed the wallet starts recovery without issue:

2018-05-03 05:18:07.589 [WRN] LTND: open /root/.lnd/lnd.conf: no such file or directory
2018-05-03 05:18:07.589 [INF] LTND: Version 0.4.1-beta commit=09c997983127bb1e575d97c69aa364a82cfb1df7
2018-05-03 05:18:07.589 [INF] LTND: Active chain: Bitcoin (network=mainnet)
2018-05-03 05:18:07.592 [INF] CHDB: Checking for schema update: latest_version=0, db_version=0
2018-05-03 05:18:07.593 [INF] RPCS: Generating TLS certificates...
2018-05-03 05:18:07.616 [INF] RPCS: Done generating TLS certificates
2018-05-03 05:18:07.623 [INF] RPCS: password RPC server listening on 127.0.0.1:10009
2018-05-03 05:18:07.624 [INF] RPCS: password gRPC proxy started at 127.0.0.1:8080
2018-05-03 05:18:07.624 [INF] LTND: Waiting for wallet encryption password. Use `lncli create` to create wallet, or `lncli unlock` to unlock already created wallet.
2018-05-03 05:18:50.518 [INF] LNWL: Opened wallet
2018-05-03 05:18:50.519 [INF] LTND: Wallet recovery mode enabled with address lookahead of 250 addresses
2018-05-03 05:18:50.614 [INF] LTND: Primary chain is set to: bitcoin
2018-05-03 05:18:50.614 [INF] LTND: Initializing bitcoind backed fee estimator
2018-05-03 05:18:51.939 [INF] LNWL: Opened wallet
2018-05-03 05:18:51.944 [INF] LNWL: Started listening for blocks via ZMQ on tcp://bitcoind:29000
2018-05-03 05:18:53.281 [INF] LNWL: The wallet has been unlocked without a time limit
2018-05-03 05:18:53.283 [INF] LNWL: Catching up block hashes to height 520983, this will take a while...
2018-05-03 05:18:53.285 [INF] LTND: LightningWallet opened
2018-05-03 05:18:53.286 [INF] LNWL: RECOVERY MODE ENABLED -- rescanning for used addresses with recovery_window=250
2018-05-03 05:18:53.292 [INF] LNWL: Caught up to height 0

Steps to reproduce

Connect multiple lnds to a single Bitcoind backend+ZMQ, and restore them from some seeds with confirmed funds to an address derived from the seed.

Alternatively, I think if you drop some crucial packets on ZMQ you'll see similar results.

Expected behaviour

LND is aware when it's missing crucial information from its backend, and it recovers from data connectivity issues gracefully.

Actual behaviour

Connectivity issues with the Bitcoind backend appear to "trick" LND into the wrong state with regards to funding and possibly seeing crucial transactions.

cfromknecht commented 6 years ago

Hi @tyzbit, do you have more log output from the recovery attempts?

tyzbit commented 6 years ago

Unfortunately no, but I do plan on doing more testing (this time not on mainnet) and I will capture the logs of the nodes as they sync. I watched the other nodes sync however, and unless I missed some logs (possible), the only output was the normal block catchup output like so:

2018-05-03 07:08:21.664 [INF] LNWL: Caught up to height 470000
2018-05-03 07:08:45.032 [INF] LNWL: Caught up to height 480000
2018-05-03 07:09:08.806 [INF] LNWL: Caught up to height 490000
2018-05-03 07:09:32.974 [INF] LNWL: Caught up to height 500000
2018-05-03 07:09:55.988 [INF] LNWL: Caught up to height 510000
2018-05-03 07:10:20.065 [INF] LNWL: Caught up to height 520000

tyzbit commented 6 years ago

I did a 6 node setup on testnet and saw similar results. I won't pretend Docker doesn't complicate things, but it is an easy way to bring up and tear down nodes quickly. Relevant logs are linked inline below.

Below, TestnetLNDSeed1 and TestnetLNDSeed2 refer to instances of lnd restored with a particular seed (TestnetLNDSeed1 always uses the same seed, TestnetLNDSeed2 always uses its own distinct seed).

Methodology

Setup

6 nodes, running in Docker. 1 bitcoind backend, also running in Docker.

Generated a seed with each node, and saved it. Generated an address for each seed, and funded it with 100k testnet satoshis, which have confirmed.

I happened to note TestnetLNDSeed1 saw the funding transaction, while TestnetLNDSeed6 did not. (I verified that all funding transactions confirmed). The logs from the time the transaction should have shown were just "Caught up to height xxx" logs. Here is a log from TestnetLNDSeed1 that successfully saw a funding transaction from the backend while recovering.

Synchronizing all at once

I tore down every node, and re-created them by restoring from the seed, and let them fully sync.

Here is a log from TestnetLNDSeed1 after tearing it down and restoring from Seed1. This was synchronizing at the same time as the other 5, and it failed to find the funding TX for its address and shows a wallet balance of zero.

Here is a log from TestnetLNDSeed2 after tearing it down and restoring from Seed2. It also synchronized at the same time as the other 5, and also ended up showing a wallet balance of zero.

Results

All nodes, from TestnetLNDSeed1 to TestnetLNDSeed6 fully synchronized and showed 0 wallet balance.

$ for i in {1..6}; do docker exec -it TestnetLNDSeed$i lncli walletbalance; done
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}

Restoring a Single Node with the Rest Running

I left TestnetLNDSeed2-6 running (and thus not heavily using the backend), and tore down TestnetLNDSeed1.

I recovered TestnetLNDSeed1 from its seed, and allowed it to catch up.

Results

Here are the relevant logs from its synchronization, which did find the UTXO for its address. Likewise, checking its balance shows the correct balance:

$ lncli walletbalance
{
    "total_balance": "100000",
    "confirmed_balance": "100000",
    "unconfirmed_balance": "0"
}

Checking every wallet balance:

$ for i in {1..6}; do docker exec -it TestnetLNDSeed$i lncli walletbalance; done
{
    "total_balance": "100000",
    "confirmed_balance": "100000",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}
{
    "total_balance": "0",
    "confirmed_balance": "0",
    "unconfirmed_balance": "0"
}

Like I mentioned above, Docker could be a complicating factor but I think it's only helping to expose the issue, not cause it because as I've shown, if I restore a single instance while every other instance is "idling", lnd does successfully catch up and arrive at the correct wallet balance.

Roasbeef commented 6 years ago

I'd try to do a single isolated node. This way we can ignore any overhead or latency, or communication or w/e introduced by docker.

Roasbeef commented 6 years ago

I'm also not sure howbitcoind's zmq integration handles multiple clients.

tyzbit commented 6 years ago

Which type of non-Docker testing would you prefer from here:

A single node + backend destroyed and rebuilt many times with checking the wallet balance after every rebuild
A single node + backend destroyed and rebuilt with something like 1% packetloss introduced between the node and its backend

I think the question I'm trying to answer is: "Does LND handle missed data from bitcoind intelligently (replaying what it needs when it detects it is missing information)" and if not, can it be improved to more reliably get synced to the true state arising from the underlying chain.

tyzbit commented 6 years ago

Just following up on this, I funded a new wallet with 11 separate addresses for which the funding txs ended up in 7 different blocks. Results below.

Method

I did a single node teardown and restore 8 times on testnet on a new VM, without using Docker.

System Info

lnd version 0.4.1-beta commit=bdf3f4d7753ce950bd14ed56ad96c12097f1b4c9 Bitcoin Core Daemon version v0.16.0.0-g4b4d7eb255, connected via localhost

Results

lnd arrived at the correct, expected wallet balance every time.

Next steps

I've separated lnd and bitcoind onto different machines in the same subnet and will now tear down and restore the same seed as above many times to see if lnd ever misses a funding tx.

Other avenues for testing

Doing multiple channel openings and closings between two nodes (with their own backends) to see if lnd ever misses an update from bitcoind
Running lnd+bitcoind for a long time and ensuring all on-chain events are caught.

tyzbit commented 6 years ago

Separately, I noticed bitcoind's default for rpcthreads is 4, which would explain my missing data in the 6/9 node configuration. The threads likely get overloaded with such heavy usage by multiple instances of lnd. I'm looking to see how many concurrent connections to the RPC lnd makes, but if it's more than 1, that would exacerbate the issue. Plus, if the users have other applications using their bitcoind RPC (or ZMQ), they could feasibly knock lnd off temporarily causing it to miss data.

Roasbeef commented 6 years ago

Separately, I noticed bitcoind's default for rpcthreads is 4, which would explain my missing data in the 6/9 node configuration.

Hmm, yeah that could explain why we'd miss events or even they'd be severely delayed at times. In light of this fact, we may want to update the docs to recommend one runs a bounded number of lnd instances using the bitcoind backend. btcd will ensure that each client gets a dedicated goroutine, so this isn't an issue for it.

tyzbit commented 6 years ago

I dug as far down as btcjson but couldn't answer my question of how many concurrent RPC connections lnd makes to bitcoind. I've tore down and restored my testnet lnd+remote bitcoind (local network) 5 times and so far, still getting the expected wallet balance. As mentioned above, once I've done this a couple more times I'll start testing elsewhere.

Separately, I will open a PR to update the bitcoind docs as you suggested.

Roasbeef commented 6 years ago

@tyzbit atm lnd will make 3 concurrent connections to bitcoind: the ChainNotifier, the wallet backend itself, and the FilteredChainView.

tyzbit commented 6 years ago

@Roasbeef That coincides with what my latest testing started showing.

Bringing the RPC threads down to 1, and 2 both showed the wrong wallet balances, but I expect rpcthreads=3 to show correctly.

I think having a good connection is still essential (so, don't connect to bitcoinds on the other side of the Internet), but it's most important not to have multiple lnds using the same backend. #1204 has been updated to advise 1 lnd per bitcoind.

tyzbit commented 6 years ago

I've finished the testing I plan to do, I never saw any issues with getting the right state from the chain with 1 LND, but I was able to replicate sporadic issues with 2+ LNDs.

As an aside, I tried upping the rpcthread count to 12 on bitcoind but still saw issues, so I don't think that's an acceptable proposed fix to have multiple LNDs per one bitcoind. I'm naiive as to exactly how ZMQ behaves with multiple consumers, but it could also play into this.

This issue can be closed from my perspective, unless there are outstanding questions, or further testing anyone's interested in me doing.

Alex-CodeLab commented 6 years ago

could you check memory usage? ZMQ starts to drop when the sendbuffer (in memory) is full.

tyzbit commented 6 years ago

bitcoind is actually using 700MB, while each lnd is using between 90-120MB. The nodes are fully synced and idling.

Roasbeef commented 6 years ago

Have discussed this with @aakselrod and we have a plan moving forward to modify lnd to be able to get around the limitations of bitcoind's zmq implementation.

Alex-CodeLab commented 6 years ago

The receiving side (SUB) has a default message buffer limit of 1000 messages, after that it starts to drop.
This could be increased (if you have enough memory), or disabled (set to 0) using socket option ZMQ_RCVHWM

This does however not guarantee anything (there still could be network issues for example), so adding something to request missing messages would still be needed if you want 100% reliability.

jonasschnelli commented 6 years ago

I have no idea how ZMQ is used between Core and LND, but an alternative would be using the p2p protocol for block or transaction "push" messages. Just connect to your (local) trusted peer over 8333 and wait for INV's with MSG_BLOCK (or whatever message you are looking for). Eventually whitelist LND.

If that is not suitable, then there is the possibility of implementing RPC (http) long-poll functions (see https://github.com/bitcoin/bitcoin/pull/13262). This can be done stateless and should be pretty safe in terms of lost messages.

Long polling for blocks similar to 13262 is something I will PR soon.

Roasbeef commented 6 years ago

I have no idea how ZMQ is used between Core and LND

We use it to receive notifications for new blocks. We've had to work around the way it handles re-orgs a bit, as rather than gives us the series of blocks that have been disconnected (like btcd), it instead just gives us the new tip.

Alex-CodeLab commented 6 years ago

I think I have a fix for the occasional dropped messages. will create a PR soon (bitcoind). But I don't know if it will help LND .

@tyzbit did you have zmq send rawtx as well? (that might be too much on some systems)

Roasbeef commented 6 years ago

@FeedTheWeb a fix for bitcoind?

tyzbit commented 6 years ago

@FeedTheWeb Yes, I did. I was under the impression rawtx and rawblock were non-optional?

Alex-CodeLab commented 6 years ago

@tyzbit that could be the case (I dont know much about LND). rawtx it creates lots of data, especially when a block is found, causing the buffer to fill up for about a second .

@Roasbeef yes, in bitcoind. the default send message limit for ZMQ is too low, so that is an easy fix.

sangaman commented 6 years ago

Just an FYI with my observations running the latest from master against bitcoind 0.16.1 on a remote machine. I actually synced lnd initially running on the same machine as bitcoind - that went fine - and then copied the .lnd directory over to a remote machine. Since then, I've noticed that synced_to_chain will frequently flip to false (although it reports an up-to-date block height) and then flip back to true when new blocks are found. Otherwise, things seem pretty stable - I can open channels and other nodes can open channels with me (although not consistently because sometimes it appears that my node is out of sync with the chain).

I'm not sure this is exactly the same issue as described here, but I figure it's at probably at least related.

slush0 commented 6 years ago

I observed frequent switching of sync to false (although lnd reports correct block height). I wrote watchdog for my lnd, which (gracefully) restart lnd when the node gets out of sync.

However, such workaround doesn't work well as it seems to trigger many random error and inconsistent behavior, like "uncooperative peer" error message in Eclair while opening the channel, some stuck transactions (lnd seems to ignore that some of funding transactions already got confirmed and it is still waiting for confirmations).

I hesitate to report all these quirks as most of them seems to be related to this issue. I just wanted to report that this might be a blocking issue for many other people, too.

Roasbeef commented 6 years ago

Yep, most of the symptoms would be in line with the issue at hand. This PR should resolve the issue but we haven't yet got around to testing and reviewing it yet. It is indeed a high priority for us though. Until that's in you can try to run this PR for bitcoind: https://github.com/bitcoin/bitcoin/pull/13315

The underlying issue is that when the interfaces were written, it assumed reliable delivery of all notifications like btcd's RPC interface. However, the current zmq implementation for uses a style of zmq socket that'll instead start to drop notifications if it gets above a certain high water mark.

slush0 commented 6 years ago

Unfortunately, after short time of endless happiness that bitcoin/bitcoin#13315 fixed all my issues, lnd switched to synced_to_chain=false again and does not flip back to true. It's getting frustrating, so I'll wait to accepting pull request 511.

slush0 commented 6 years ago

I've resolved my previously described issue. It was misconfiguration of bitcoin.conf; instead of enabling zmqpubrawblock, I accidentally enabled zmqpubhashblock!

This messed up lnd, because it showed correct blockhash in logs, but didn't receive raw block from bitcoind.

Although it was completely my mistake, I suppose lnd should not mistakenly report that it received block, when it did not.

lightningnetwork / lnd

Possibly missing updates from Bitcoind backend #1174

Background

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

Methodology

Setup

Synchronizing all at once

Results

Restoring a Single Node with the Rest Running

Results

Method

System Info

Results

Next steps

Other avenues for testing