Closed tyzbit closed 6 years ago
Hi @tyzbit, do you have more log output from the recovery attempts?
Unfortunately no, but I do plan on doing more testing (this time not on mainnet) and I will capture the logs of the nodes as they sync. I watched the other nodes sync however, and unless I missed some logs (possible), the only output was the normal block catchup output like so:
2018-05-03 07:08:21.664 [INF] LNWL: Caught up to height 470000
2018-05-03 07:08:45.032 [INF] LNWL: Caught up to height 480000
2018-05-03 07:09:08.806 [INF] LNWL: Caught up to height 490000
2018-05-03 07:09:32.974 [INF] LNWL: Caught up to height 500000
2018-05-03 07:09:55.988 [INF] LNWL: Caught up to height 510000
2018-05-03 07:10:20.065 [INF] LNWL: Caught up to height 520000
I did a 6 node setup on testnet and saw similar results. I won't pretend Docker doesn't complicate things, but it is an easy way to bring up and tear down nodes quickly. Relevant logs are linked inline below.
Below, TestnetLNDSeed1 and TestnetLNDSeed2 refer to instances of lnd
restored with a particular seed (TestnetLNDSeed1 always uses the same seed, TestnetLNDSeed2 always uses its own distinct seed).
6 nodes, running in Docker. 1 bitcoind backend, also running in Docker.
Generated a seed with each node, and saved it. Generated an address for each seed, and funded it with 100k testnet satoshis, which have confirmed.
I happened to note TestnetLNDSeed1 saw the funding transaction, while TestnetLNDSeed6 did not. (I verified that all funding transactions confirmed). The logs from the time the transaction should have shown were just "Caught up to height xxx" logs. Here is a log from TestnetLNDSeed1 that successfully saw a funding transaction from the backend while recovering.
I tore down every node, and re-created them by restoring from the seed, and let them fully sync.
Here is a log from TestnetLNDSeed1 after tearing it down and restoring from Seed1. This was synchronizing at the same time as the other 5, and it failed to find the funding TX for its address and shows a wallet balance of zero.
Here is a log from TestnetLNDSeed2 after tearing it down and restoring from Seed2. It also synchronized at the same time as the other 5, and also ended up showing a wallet balance of zero.
All nodes, from TestnetLNDSeed1 to TestnetLNDSeed6 fully synchronized and showed 0 wallet balance.
$ for i in {1..6}; do docker exec -it TestnetLNDSeed$i lncli walletbalance; done
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
I left TestnetLNDSeed2-6 running (and thus not heavily using the backend), and tore down TestnetLNDSeed1.
I recovered TestnetLNDSeed1 from its seed, and allowed it to catch up.
Here are the relevant logs from its synchronization, which did find the UTXO for its address. Likewise, checking its balance shows the correct balance:
$ lncli walletbalance
{
"total_balance": "100000",
"confirmed_balance": "100000",
"unconfirmed_balance": "0"
}
Checking every wallet balance:
$ for i in {1..6}; do docker exec -it TestnetLNDSeed$i lncli walletbalance; done
{
"total_balance": "100000",
"confirmed_balance": "100000",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
{
"total_balance": "0",
"confirmed_balance": "0",
"unconfirmed_balance": "0"
}
Like I mentioned above, Docker could be a complicating factor but I think it's only helping to expose the issue, not cause it because as I've shown, if I restore a single instance while every other instance is "idling", lnd
does successfully catch up and arrive at the correct wallet balance.
I'd try to do a single isolated node. This way we can ignore any overhead or latency, or communication or w/e introduced by docker.
I'm also not sure howbitcoind
's zmq integration handles multiple clients.
Which type of non-Docker testing would you prefer from here:
I think the question I'm trying to answer is: "Does LND handle missed data from bitcoind intelligently (replaying what it needs when it detects it is missing information)" and if not, can it be improved to more reliably get synced to the true state arising from the underlying chain.
Just following up on this, I funded a new wallet with 11 separate addresses for which the funding txs ended up in 7 different blocks. Results below.
I did a single node teardown and restore 8 times on testnet on a new VM, without using Docker.
lnd version 0.4.1-beta commit=bdf3f4d7753ce950bd14ed56ad96c12097f1b4c9
Bitcoin Core Daemon version v0.16.0.0-g4b4d7eb255
, connected via localhost
lnd
arrived at the correct, expected wallet balance every time.
I've separated lnd
and bitcoind
onto different machines in the same subnet and will now tear down and restore the same seed as above many times to see if lnd
ever misses a funding tx.
lnd
ever misses an update from bitcoind
lnd
+bitcoind
for a long time and ensuring all on-chain events are caught.Separately, I noticed bitcoind
's default for rpcthreads
is 4, which would explain my missing data in the 6/9 node configuration. The threads likely get overloaded with such heavy usage by multiple instances of lnd
. I'm looking to see how many concurrent connections to the RPC lnd
makes, but if it's more than 1, that would exacerbate the issue. Plus, if the users have other applications using their bitcoind
RPC (or ZMQ), they could feasibly knock lnd
off temporarily causing it to miss data.
Separately, I noticed bitcoind's default for rpcthreads is 4, which would explain my missing data in the 6/9 node configuration.
Hmm, yeah that could explain why we'd miss events or even they'd be severely delayed at times. In light of this fact, we may want to update the docs to recommend one runs a bounded number of lnd
instances using the bitcoind
backend. btcd
will ensure that each client gets a dedicated goroutine, so this isn't an issue for it.
I dug as far down as btcjson
but couldn't answer my question of how many concurrent RPC connections lnd
makes to bitcoind
. I've tore down and restored my testnet lnd
+remote bitcoind
(local network) 5 times and so far, still getting the expected wallet balance. As mentioned above, once I've done this a couple more times I'll start testing elsewhere.
Separately, I will open a PR to update the bitcoind
docs as you suggested.
@tyzbit atm lnd
will make 3 concurrent connections to bitcoind
: the ChainNotifier
, the wallet backend itself, and the FilteredChainView
.
@Roasbeef That coincides with what my latest testing started showing.
Bringing the RPC threads down to 1, and 2 both showed the wrong wallet balances, but I expect rpcthreads=3
to show correctly.
I think having a good connection is still essential (so, don't connect to bitcoind
s on the other side of the Internet), but it's most important not to have multiple lnd
s using the same backend. #1204 has been updated to advise 1 lnd
per bitcoind
.
I've finished the testing I plan to do, I never saw any issues with getting the right state from the chain with 1 LND, but I was able to replicate sporadic issues with 2+ LNDs.
As an aside, I tried upping the rpcthread count to 12 on bitcoind but still saw issues, so I don't think that's an acceptable proposed fix to have multiple LNDs per one bitcoind. I'm naiive as to exactly how ZMQ behaves with multiple consumers, but it could also play into this.
This issue can be closed from my perspective, unless there are outstanding questions, or further testing anyone's interested in me doing.
could you check memory usage? ZMQ starts to drop when the sendbuffer (in memory) is full.
bitcoind
is actually using 700MB, while each lnd
is using between 90-120MB. The nodes are fully synced and idling.
Have discussed this with @aakselrod and we have a plan moving forward to modify lnd
to be able to get around the limitations of bitcoind
's zmq implementation.
The receiving side (SUB) has a default message buffer limit of 1000 messages, after that it starts to drop.
This could be increased (if you have enough memory), or disabled (set to 0)
using socket option ZMQ_RCVHWM
This does however not guarantee anything (there still could be network issues for example), so adding something to request missing messages would still be needed if you want 100% reliability.
I have no idea how ZMQ is used between Core and LND, but an alternative would be using the p2p protocol for block or transaction "push" messages. Just connect to your (local) trusted peer over 8333 and wait for INV's with MSG_BLOCK (or whatever message you are looking for). Eventually whitelist LND.
If that is not suitable, then there is the possibility of implementing RPC (http) long-poll functions (see https://github.com/bitcoin/bitcoin/pull/13262). This can be done stateless and should be pretty safe in terms of lost messages.
Long polling for blocks similar to 13262 is something I will PR soon.
I have no idea how ZMQ is used between Core and LND
We use it to receive notifications for new blocks. We've had to work around the way it handles re-orgs a bit, as rather than gives us the series of blocks that have been disconnected (like btcd
), it instead just gives us the new tip.
I think I have a fix for the occasional dropped messages. will create a PR soon (bitcoind). But I don't know if it will help LND .
@tyzbit did you have zmq send rawtx as well? (that might be too much on some systems)
@FeedTheWeb a fix for bitcoind
?
@FeedTheWeb Yes, I did. I was under the impression rawtx and rawblock were non-optional?
@tyzbit that could be the case (I dont know much about LND). rawtx it creates lots of data, especially when a block is found, causing the buffer to fill up for about a second .
@Roasbeef yes, in bitcoind. the default send message limit for ZMQ is too low, so that is an easy fix.
Just an FYI with my observations running the latest from master against bitcoind 0.16.1 on a remote machine. I actually synced lnd initially running on the same machine as bitcoind - that went fine - and then copied the .lnd
directory over to a remote machine. Since then, I've noticed that synced_to_chain
will frequently flip to false
(although it reports an up-to-date block height) and then flip back to true when new blocks are found. Otherwise, things seem pretty stable - I can open channels and other nodes can open channels with me (although not consistently because sometimes it appears that my node is out of sync with the chain).
I'm not sure this is exactly the same issue as described here, but I figure it's at probably at least related.
I observed frequent switching of sync to false (although lnd reports correct block height). I wrote watchdog for my lnd, which (gracefully) restart lnd when the node gets out of sync.
However, such workaround doesn't work well as it seems to trigger many random error and inconsistent behavior, like "uncooperative peer" error message in Eclair while opening the channel, some stuck transactions (lnd seems to ignore that some of funding transactions already got confirmed and it is still waiting for confirmations).
I hesitate to report all these quirks as most of them seems to be related to this issue. I just wanted to report that this might be a blocking issue for many other people, too.
Yep, most of the symptoms would be in line with the issue at hand. This PR should resolve the issue but we haven't yet got around to testing and reviewing it yet. It is indeed a high priority for us though. Until that's in you can try to run this PR for bitcoind: https://github.com/bitcoin/bitcoin/pull/13315
The underlying issue is that when the interfaces were written, it assumed reliable delivery of all notifications like btcd
's RPC interface. However, the current zmq implementation for uses a style of zmq socket that'll instead start to drop notifications if it gets above a certain high water mark.
Unfortunately, after short time of endless happiness that bitcoin/bitcoin#13315 fixed all my issues, lnd switched to synced_to_chain=false again and does not flip back to true. It's getting frustrating, so I'll wait to accepting pull request 511.
I've resolved my previously described issue. It was misconfiguration of bitcoin.conf; instead of enabling zmqpubrawblock, I accidentally enabled zmqpubhashblock!
This messed up lnd, because it showed correct blockhash in logs, but didn't receive raw block from bitcoind.
Although it was completely my mistake, I suppose lnd should not mistakenly report that it received block, when it did not.
Background
I have 9 lnd's sharing a single Bitcoind backend on a single server. I didn't open a channel with any of them in order to see this issue. I funded the first address generated from each lnd's wallet with some satoshis(which confirmed), and destroyed each lnd. Then, I restored each using the seed phrase, with multiple re-synchronizing from Bitcoind at once. Some of them (it seems the first ones to get a connection to Bitcoind) restored fine, with funds in their wallet. Others (the later ones to connect) seemed to miss their funding transactions. The address look-ahead window should have caught the funding transactions, as the addresses I sent to were the first generated addresses for each seed.
Your environment
lnd
=0.4.1-beta commit=09c997983127bb1e575d97c69aa364a82cfb1df7
uname -a
on *Nix) =Linux provingapoint 4.4.0-109-generic #132-Ubuntu SMP Tue Jan 9 19:52:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
btcd
,bitcoind
, or other backend =Bitcoin Core Daemon version v0.16.0.0-g4b4d7eb255
All running in Docker containers, which probably exacerbated the connectivity issues.
I've confirmed the wallet starts recovery without issue:
Steps to reproduce
Connect multiple lnds to a single Bitcoind backend+ZMQ, and restore them from some seeds with confirmed funds to an address derived from the seed.
Alternatively, I think if you drop some crucial packets on ZMQ you'll see similar results.
Expected behaviour
LND is aware when it's missing crucial information from its backend, and it recovers from data connectivity issues gracefully.
Actual behaviour
Connectivity issues with the Bitcoind backend appear to "trick" LND into the wrong state with regards to funding and possibly seeing crucial transactions.