Failed to complete long rescans.

roeierez commented 5 years ago

I reproduced the problem initially when syncing a node that was offline for 2 months and needed to fetch several thousands of filters. The problem has started to show up after the change to query compact filters in batches (1000 each).

At some point I get this error:

“Peer 94.23.154.171:8333 (outbound) appears to be stalled or misbehaving, block timeout -- disconnecting”, After disconnecting neutrino doesn't further attempt to connect and is not functioning. There is a suggestion on how to recover from this state here (https://github.com/lightninglabs/neutrino/issues/138).

I have noticed in the logs that some queries (like getdata to fetch a block) weren't answered on time. I raised the QueryBatchTimeout from 30 seconds to 100 seconds and I was able to sync but this was just to understand the problem. I don't think there is a right number here for the timeout as we don't really know the batch size.

The godoc comment for this field is:

// QueryBatchTimout is the total time we'll wait for a batch fetch
// query to complete.
// TODO(halseth): instead use timeout since last received response?

Could be that the last line might be the solution for this problem.

Roasbeef commented 5 years ago

What type of hardware are you running on? Also the latency of the link? Your neutrino node disconnect since btcd didn't reply to its queries in time. This doesn't have anything to do with that TOOD section you linked above, this is more generic stall detection to ensure we don't waste time waiting on a peer that'll never reply. My current hunch is that your node either rate limited the neutrino node, or it ended up banning it.

roeierez commented 5 years ago

I am running on mobile Samsung Galaxy s7 with download speed of 20Mbps. On my mac book pro desktop with same wifi I couldn't reproduce it so I think the hardware is a factor here. mobile io/cpu speed are sometimes 10x slower than my desktop. I am running vs btcd that has no rate limit, at least I didn't configure anything like this. To be sure I also tested with btcd-mainnet.lightning.computer btcd and I was able to reproduce. Only increasing the QueryBatchTimeout or decreasing the batch size fixed the problem.

halseth commented 5 years ago

Is this on neutrino master? Some fixes in this area recently got merged.

halseth commented 5 years ago

You could also try using this patch to see if it helps: https://github.com/lightninglabs/neutrino/pull/162

roeierez commented 5 years ago

@halseth yes on master. I will try this patch.

roeierez commented 5 years ago

@halseth I tried the patch and it didn't fix the issue. I went over the patch and I saw the fix was more around queryChainServiceBatch. While this method is called for fetching cfheaders, the filters themselves are fetched by calling queryChainServicePeers and this explains why the patch didn't help. I now able to reproduce this issue relatively easy by making sure the wallet is synched 1000 blocks behind the tip and reducing the QueryBatchTimeout to a value that will prevent the 1000 items batch to complete by that time. I see that in this case there is a timeout in neutrino, causing one time retry for the whole batch which by itself will trigger timeout again and a log message that shows that not all filters were retrieved:

Query failed with 719 out of 1000 filters received

Closing the query by the client still doesn't prevent from btcd to attempt to complete sending the whole batch. This causes an increasing load on btcd and the client as with each retry messages are re-sent again and again until btcd complains on stall behavior. In addition to that the increasing load causes other queries that are needed during the rescan (getdata for example) to hit timeout causing the rescan to exit with error.

In my view there are two steps can be made here:

I think preventing from putting increasing load on btcd is the key here. The timeout logic is now strict "X" seconds for the whole batch and that will never be bullet proof for every client environment. Adapting the timeout on the go, like reseting it with each batch message received will prevent the issue.
Retry in case of batch can be a lot more efficient by asking only for the range that is missed and not for the whole batch.

LMK what you think, I can help here in shaping the solution, testing etc...

lightninglabs / neutrino

Failed to complete long rescans. #164