lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.69k stars 2.08k forks source link

[bug]: Watchtower gives up to fetch block in case of timeout #8205

Closed jviki closed 3 days ago

jviki commented 11 months ago

Background

Describe your issue here.

If lnd watchtower fails to fetch a block, it looks like it never tries it again and thus it might not do its job. This would be especially an issue when using neutrino backend. It seems to be related to this comment in the code: https://github.com/lightningnetwork/lnd/blob/d9b88fba67a91583227d1d986deaa05177585bb1/watchtower/lookout/lookout.go#L125

Failing because of some timeout (no clue yet what is the exact reason behind that):

2023-11-18 18:37:29.084 [DBG] WTWR: Fetching block for (height=817357, hash=000000000000000000017421e65bd66c9432edd46c5e8b70cd4c12b542610e93) 2023-11-18 18:37:35.087 [ERR] WTWR: Unable to fetch block for (height=c78cd, hash=000000000000000000017421e65bd66c9432edd46c5e8b70cd4c12b542610e93): did not get response before timeout

Success after lnd restart:

2023-11-18 18:53:17.040 [DBG] WTWR: Fetching block for (height=817357, hash=000000000000000000017421e65bd66c9432edd46c5e8b70cd4c12b542610e93) 2023-11-18 18:53:23.034 [DBG] WTWR: Scanning 4350 transaction in block (height=817357, hash=000000000000000000017421e65bd66c9432edd46c5e8b70cd4c12b542610e93) for breaches 2023-11-18 18:53:23.133 [DBG] WTWR: No breaches found in (height=817357, hash=000000000000000000017421e65bd66c9432edd46c5e8b70cd4c12b542610e93)

Your environment

Steps to reproduce

No idea how to easily reproduce this exact case at the moment. It would be about searching for situation when the lnd with watchtower starts, connects to the (probably neutrino) backend and than the backend is disconnected.

Expected behaviour

The watchtower retries to fetch the missing block or somehow loudly explicitly says that it gives up.

Actual behaviour

It fails to obtain the wanted block due to timeout. It is not clear if it is just a temporary or permanent status. It is difficult to do any other diagnostics.

GeorgeTsagk commented 11 months ago

Hey @jviki thanks for making this issue! After failing to fetch block for height=817357 did you observe any attempts for next blocks i.e 817358?

jviki commented 11 months ago

This is a tricky question. I have been playing with the node a lot and did lots of restarts. Thus, yes, it fetched the block, of course. But because of restarts.

I could see this situation multiple times. I don't say, it is broken. What I am trying to say is that when watching the logs, I have no clue if the watchtower is working or if it somehow gave up on its task. How long should I wait to confirm this when watching the logs? What log message am I looking for?

In the end, this is probably quite similar issue to PR https://github.com/lightningnetwork/lnd/issues/4447, i.e. lack of information of the runtime state.

GeorgeTsagk commented 11 months ago

In the end, this is probably quite similar issue to PR https://github.com/lightningnetwork/lnd/issues/4447, i.e. lack of information of the runtime state.

yeah, could def use more info on WC state

But because of restarts.

Yeap, by following your link it seems that we lack any retry logic on the WC, I believe this could be quite easily added

anibilthare commented 9 months ago

@saubyk how do we want to support this, does it make sense to make this configurable from the config file? We can have a default > 1 , let’s say 3. If user thinks this is too much or too less they can set it according to their needs

ellemouton commented 9 months ago

@anibilthare - commented on your PR too. I think we can just make it retry indefinitely with an exponential backoff