spesmilo / electrum

Electrum Bitcoin Wallet
https://electrum.org
MIT License
7.22k stars 3.02k forks source link

Timeout error shadowed by aiorpcx cancellation bug (interface/synchronizer, wallets stuck) #8954

Closed MrNaif2018 closed 1 month ago

MrNaif2018 commented 3 months ago

There is a problem in interface's synchronizer: if we connect to a new server, we first subscribe for notifications using blockchain.scripthash.subscribe, but if server is too slow to respond (longer than default timeout of 30 seconds), then aiorpcx's timeout_after raises it's exception which inherits from CancelledError, which apparently causes async task to be not only not handled, but also cancelled and disappear, which makes synchronizer's self.requested_addrs have some requested addresses in it, but never emptied, causing wallet.is_up_to_date() return False, which will stick until interface is changed (by luck after a few hours or in case of exception which may never be raised)

Reproducer:

diff --git a/aiorpcx/session.py b/aiorpcx/session.py
index efff404..ca96560 100755
--- a/aiorpcx/session.py
+++ b/aiorpcx/session.py
@@ -509,6 +509,8 @@ class RPCSession(SessionBase):
             send_time = await self._send_message(message)
             try:
                 async with timeout_after(self.sent_request_timeout):
+                    if b"blockchain.scripthash.subscribe" in message:
+                        await asyncio.sleep(40)
                     return await future
             finally:
                 time_taken = max(0, time.time() - send_time)

Similar issue we had before: 12e628e2e6eb39709361b587a87e35f291200d16

SomberNight commented 3 months ago

This should be fixed in aiorpcx. There is more discussion in https://github.com/kyuupichan/aiorpcX/pull/48 I think https://github.com/kyuupichan/aiorpcX/commit/e1c7a227f8886972c6d18f920cd5c39462e055d3 is sufficient.

(note: I was playing around a bit in https://github.com/spesmilo/electrum/compare/master...SomberNight:electrum:202403_asyncio_waitfor, e.g. see https://github.com/sombernight/electrum/commit/b388ec3d8caddcc66d69bcd32f6a59ebacf21b16)

MrNaif2018 commented 3 months ago

By the way there are a few more timeout-related exceptions in aiorpcx which still inherit from CancelledError, they won't cause problems?

SomberNight commented 3 months ago

There is only TimeoutCancellationError, isn't there? I have tried and haven't managed to write a unit test that involved triggering something like this for that. Feel free to try.

MrNaif2018 commented 3 months ago

It looks like wallet is still not synchronized due to synchronizer.is_up_to_date() being False. It's False once again due to self.requested_addrs being non-empty. So I think what happens is: we now get a timeout, but those addresses which were already requested but failed with a timeout are still left in self.requested_addrs, and for some reason it never gets cleared. I think the Synchronizer task is working, but it's state wasn't cleared during interface change when taskgroup died

MrNaif2018 commented 3 months ago

So I found out this issue is not reproducible on python 3.12, so probably it's still some asyncio bug involved when taskgroup dies still triggering similar issues Will test on 3.11 now to see if it works in 3.11 too

MrNaif2018 commented 3 months ago

It works in 3.11 too. So indeed something similar is involved even after taskgroup has died

SomberNight commented 2 months ago

To be clear, aiorpcx 0.23.1 is sufficient to fix the reproducer from the OP, right? But you are saying your problem is not sufficiently covered by that reproducer?

MrNaif2018 commented 2 months ago

aiorpcx is sufficient to fix initial reproducer, yes I think Exception is logged correctly, but when taskgroup dies, some logic elsewhere makes it so that the wallet which was behind the died taskgroup does not recover. All other wallets who just receive stop signal recover just fine

SomberNight commented 1 month ago

Well, the example in the OP has been fixed, so I am inclined to close this. You are very welcome to open a new issue if you can somewhat detail the remaining bug or ideally create a new reproducer for it.