project-receptor / python-receptor

Project Receptor is a flexible multi-service relayer with remote execution and orchestration capabilities linking controllers with executors across a mesh of nodes.
Other
32 stars 21 forks source link

Add a timeout for waiting on the HI response from the peer node. #217

Closed dehort closed 4 years ago

dehort commented 4 years ago

This is a potential fix for Issue #199.

I looked at adding timeout=x and read_timeout=x to the ws_client() call. These parameters do cause a timeout exception to be raised but its not clear to me how to propagate the Timeout exception from the read_task coroutine to the _wait_handshake() method (I believe the exception would have to be placed in the buffer?). Using those parameters also appear to cause the normal socket reading logic to receive a timeout as the coroutine just blocks indefinitely waiting on data to be read from the socket.

This approach is the most surgical approach that I can come up with. Let me know if there is a better approach.

ghjm commented 4 years ago

I think maybe I don't understand the problem this is solving. In an attempt to reproduce it, I did this in connection/base.py:

     async def hello(self):
-        msg = self.receptor._say_hi().serialize()
-        await self.conn.send(BridgeQueue.one(msg))
+        #msg = self.receptor._say_hi().serialize()
+        #await self.conn.send(BridgeQueue.one(msg))
+        return

I am testing this by running

receptor --node-id foo node --listen ws://127.0.0.1:1234

and then

receptor ping foo --peer ws://127.0.0.1:1234

in separate windows.

With current devel, I get the following result:

(graham-dev) [graham@fedora-dev ~]$ receptor ping foo --peer ws://127.0.0.1:1234
Connection timed out. Exiting.

With the PR, I get a different result:

(graham-dev) [graham@fedora-dev ~]$ receptor ping foo --peer ws://127.0.0.1:1234
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f6a95010450>
ERROR 2020-05-04 12:54:45,506  ws ws.connect
Traceback (most recent call last):
  File "/home/graham/git/receptor/receptor/connection/ws.py", line 67, in connec                                         t
    await worker.client(t)
  File "/home/graham/git/receptor/receptor/connection/base.py", line 156, in cli                                         ent
    await self._wait_handshake()
  File "/home/graham/git/receptor/receptor/connection/base.py", line 146, in _wa                                         it_handshake
    response = await self.buf.get(timeout=5.0)
  File "/home/graham/git/receptor/receptor/messages/framed.py", line 291, in get
    return await asyncio.wait_for(self.q.get(), timeout)
  File "/usr/lib64/python3.7/asyncio/tasks.py", line 449, in wait_for
    raise futures.TimeoutError()
concurrent.futures._base.TimeoutError
Connection timed out. Exiting.

I have not been able to reproduce any case where the client hangs forever. What steps are you using to produce this result?

dehort commented 4 years ago

@ghjm I think what you are seeing is that the timeout is getting handled by the entrypoints::run_oneshot_command() method. I believe I can reproduce the issue using your code changes above by connecting to the misbehaving server as a regular node.

python -m receptor --debug -d /tmp/node-foo2 --node-id foo2 -d /tmp/node-foo2 node  --peer ws://127.0.0.1:1234
dehort commented 4 years ago

Yeah, I didn't really like hard coding the timeout like that but I dont see where the configuration is getting passed into the connection module. I guess I could get access to the config object via the receptor object? Is that the right approach here?

I'm actually still struggling to reproduce the issue using the receptor's server logic, but I found an easier way to reproduce the client hanging issue....just use netcat as the server.

Start a misbehaving server:

nc -l 1234

Start a node:

python -m receptor --debug -d /tmp/node-foo2 --node-id foo2 -d /tmp/node-foo2 node  --peer rnp://127.0.0.1:1234

The client will hang indefinitely. Killing the netcat process will not cause the client to recover.

ghjm commented 4 years ago

I think hardcoding a value of 20 seconds is good enough to be getting on with. We have a lot of hardcoded values that need to be configurable, and probably also need to make significant changes to config.py in the future.

ghjm commented 4 years ago

recheck