Closed dehort closed 4 years ago
I think maybe I don't understand the problem this is solving. In an attempt to reproduce it, I did this in connection/base.py
:
async def hello(self):
- msg = self.receptor._say_hi().serialize()
- await self.conn.send(BridgeQueue.one(msg))
+ #msg = self.receptor._say_hi().serialize()
+ #await self.conn.send(BridgeQueue.one(msg))
+ return
I am testing this by running
receptor --node-id foo node --listen ws://127.0.0.1:1234
and then
receptor ping foo --peer ws://127.0.0.1:1234
in separate windows.
With current devel, I get the following result:
(graham-dev) [graham@fedora-dev ~]$ receptor ping foo --peer ws://127.0.0.1:1234
Connection timed out. Exiting.
With the PR, I get a different result:
(graham-dev) [graham@fedora-dev ~]$ receptor ping foo --peer ws://127.0.0.1:1234
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f6a95010450>
ERROR 2020-05-04 12:54:45,506 ws ws.connect
Traceback (most recent call last):
File "/home/graham/git/receptor/receptor/connection/ws.py", line 67, in connec t
await worker.client(t)
File "/home/graham/git/receptor/receptor/connection/base.py", line 156, in cli ent
await self._wait_handshake()
File "/home/graham/git/receptor/receptor/connection/base.py", line 146, in _wa it_handshake
response = await self.buf.get(timeout=5.0)
File "/home/graham/git/receptor/receptor/messages/framed.py", line 291, in get
return await asyncio.wait_for(self.q.get(), timeout)
File "/usr/lib64/python3.7/asyncio/tasks.py", line 449, in wait_for
raise futures.TimeoutError()
concurrent.futures._base.TimeoutError
Connection timed out. Exiting.
I have not been able to reproduce any case where the client hangs forever. What steps are you using to produce this result?
@ghjm I think what you are seeing is that the timeout is getting handled by the entrypoints::run_oneshot_command() method. I believe I can reproduce the issue using your code changes above by connecting to the misbehaving server as a regular node.
python -m receptor --debug -d /tmp/node-foo2 --node-id foo2 -d /tmp/node-foo2 node --peer ws://127.0.0.1:1234
Yeah, I didn't really like hard coding the timeout like that but I dont see where the configuration is getting passed into the connection module. I guess I could get access to the config object via the receptor object? Is that the right approach here?
I'm actually still struggling to reproduce the issue using the receptor's server logic, but I found an easier way to reproduce the client hanging issue....just use netcat as the server.
Start a misbehaving server:
nc -l 1234
Start a node:
python -m receptor --debug -d /tmp/node-foo2 --node-id foo2 -d /tmp/node-foo2 node --peer rnp://127.0.0.1:1234
The client will hang indefinitely. Killing the netcat process will not cause the client to recover.
I think hardcoding a value of 20 seconds is good enough to be getting on with. We have a lot of hardcoded values that need to be configurable, and probably also need to make significant changes to config.py
in the future.
recheck
This is a potential fix for Issue #199.
I looked at adding timeout=x and read_timeout=x to the ws_client() call. These parameters do cause a timeout exception to be raised but its not clear to me how to propagate the Timeout exception from the read_task coroutine to the _wait_handshake() method (I believe the exception would have to be placed in the buffer?). Using those parameters also appear to cause the normal socket reading logic to receive a timeout as the coroutine just blocks indefinitely waiting on data to be read from the socket.
This approach is the most surgical approach that I can come up with. Let me know if there is a better approach.