Closed roeierez closed 5 years ago
What type of hardware are you running on? Also the latency of the link? Your neutrino node disconnect since btcd
didn't reply to its queries in time. This doesn't have anything to do with that TOOD section you linked above, this is more generic stall detection to ensure we don't waste time waiting on a peer that'll never reply. My current hunch is that your node either rate limited the neutrino node, or it ended up banning it.
I am running on mobile Samsung Galaxy s7 with download speed of 20Mbps. On my mac book pro desktop with same wifi I couldn't reproduce it so I think the hardware is a factor here. mobile io/cpu speed are sometimes 10x slower than my desktop. I am running vs btcd that has no rate limit, at least I didn't configure anything like this. To be sure I also tested with btcd-mainnet.lightning.computer btcd and I was able to reproduce. Only increasing the QueryBatchTimeout or decreasing the batch size fixed the problem.
Is this on neutrino master? Some fixes in this area recently got merged.
You could also try using this patch to see if it helps: https://github.com/lightninglabs/neutrino/pull/162
@halseth yes on master. I will try this patch.
@halseth I tried the patch and it didn't fix the issue.
I went over the patch and I saw the fix was more around queryChainServiceBatch
. While this method is called for fetching cfheaders, the filters themselves are fetched by calling queryChainServicePeers
and this explains why the patch didn't help.
I now able to reproduce this issue relatively easy by making sure the wallet is synched 1000 blocks behind the tip and reducing the QueryBatchTimeout
to a value that will prevent the 1000 items batch to complete by that time.
I see that in this case there is a timeout in neutrino, causing one time retry for the whole batch which by itself will trigger timeout again and a log message that shows that not all filters were retrieved:
Query failed with 719 out of 1000 filters received
Closing the query by the client still doesn't prevent from btcd to attempt to complete sending the whole batch. This causes an increasing load on btcd and the client as with each retry messages are re-sent again and again until btcd complains on stall behavior. In addition to that the increasing load causes other queries that are needed during the rescan (getdata for example) to hit timeout causing the rescan to exit with error.
In my view there are two steps can be made here:
LMK what you think, I can help here in shaping the solution, testing etc...
I reproduced the problem initially when syncing a node that was offline for 2 months and needed to fetch several thousands of filters. The problem has started to show up after the change to query compact filters in batches (1000 each).
At some point I get this error:
“Peer 94.23.154.171:8333 (outbound) appears to be stalled or misbehaving, block timeout -- disconnecting”, After disconnecting neutrino doesn't further attempt to connect and is not functioning. There is a suggestion on how to recover from this state here (https://github.com/lightninglabs/neutrino/issues/138).
I have noticed in the logs that some queries (like getdata to fetch a block) weren't answered on time. I raised the QueryBatchTimeout from 30 seconds to 100 seconds and I was able to sync but this was just to understand the problem. I don't think there is a right number here for the timeout as we don't really know the batch size.
The godoc comment for this field is:
Could be that the last line might be the solution for this problem.