skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

TCPEchoServer: abort the transport on connection lost #1582

Closed gabordozsa closed 1 month ago

gabordozsa commented 1 month ago

This is to avoid potential ReseourceWarnings and write() exceptions during tearDown() of SSL tests. I hit this with system_tests_tcp_conns_terminate.py but I think it can happen in any other test which uses the echo client with SSL. Example warning :


2024-07-26 15:48:08.192808  TerminateTcpConnectionsTest ECHO_SERVER_address_default_ssl: Connection to 127.0.0.1:49886 lost, exception=None
/usr/lib/python3.10/asyncio/sslproto.py:320: ResourceWarning: unclosed transport <asyncio.sslproto._SSLProtocolTransport object at 0xffffb8eef460>
  _warn(f"unclosed transport {self!r}", ResourceWarning, source=self)
ResourceWarning: Enable tracemalloc to get the object allocation traceback
gabordozsa commented 1 month ago

I have run system_tests_tcp_conns_terminate.py more times and I managed to hit the write() exception again in the tearDown() when the wait() is called for the SSL enabled TCPEchoServers. So this PR helps to avoid the ResourceWarning messages but the write() exception still happens sporadically.

2024-07-26 22:14:57.731920  TerminateTcpConnectionsTest ECHO_SERVER_address_default_ssl: Connection to 127.0.0.1:43546 lost, exception=None                                                                                                          
2024-07-26 22:14:57.732032  TerminateTcpConnectionsTest ECHO_SERVER_address_default_ssl: Connection to 127.0.0.1:43568 lost, exception=None                                                                                                          
2024-07-26 22:14:57.732097  TerminateTcpConnectionsTest ECHO_SERVER_address_terminate Server shutdown completed
2024-07-26 22:14:57.732143  TerminateTcpConnectionsTest ECHO_SERVER_address_default_ssl Server is shutting down
2024-07-26 22:14:57.732535  TerminateTcpConnectionsTest ECHO_SERVER_address_default_ssl Server shutdown completed
2024-07-26 22:14:57.732596  TerminateTcpConnectionsTest ECHO_SERVER_address_termnate_ssl Server is shutting down
2024-07-26 22:14:57.732928  TerminateTcpConnectionsTest ECHO_SERVER_address_termnate_ssl Server shutdown completed

----------------------------------------------------------------------
Ran 4 tests in 33.230s

OK
ERROR:asyncio:Fatal error on SSL transport
protocol: <asyncio.sslproto.SSLProtocol object at 0xffffa76704f0>
transport: <_SelectorSocketTransport closing fd=49>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 924, in write
    n = self._sock.send(data)
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/sslproto.py", line 690, in _process_write_backlog
    self._transport.write(chunk)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 930, in write
    self._fatal_error(exc, 'Fatal write error on socket transport')
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 725, in _fatal_error
    self._force_close(exc)
  File "/usr/lib/python3.10/asyncio/selector_events.py", line 737, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 753, in call_soon
    self._check_closed()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 515, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
kgiusti commented 1 month ago

TL;DR - we should land this fix since there's really no better alternative.

See https://github.com/python/cpython/issues/113538

The problem is that there doesn't appear to be a way to cleanly shutdown the asyncio.Server. And by "cleanly" I specifically mean closing currently open client sockets and reclaiming resources. See the above cpython issue, I'm pretty sure that's the root of that problem.

The other thing to consider is that maybe the way we terminate the server is wrong:

https://github.com/skupperproject/skupper-router/blob/9211296e96ccee6131cec28d2313c3864a03a639/tests/TCP_echo_server.py#L223

What we're trying to implement is a way to stop the Server from another thread via calling the wait() method. I'm not sure the _cancel_server() code is the right way of doing this. I seem to recall hacking at this for awhile since I could not find a clear explaination of how this should be done on the Python docs. If there's anyone out there that actually understands the nuances of running an asyncio.Server asynchronously from the test I'm open to suggestions!

ganeshmurthy commented 1 month ago

@gabordozsa Do you have commit rights to the skupper-router github repo ? If not, I will ask @ted-ross or Andy Smith to give you commit rights.

gabordozsa commented 1 month ago

@gabordozsa Do you have commit rights to the skupper-router github repo ? If not, I will ask @ted-ross or Andy Smith to give you commit rights.

@ganeshmurthy No, I do not have write access yet.

ganeshmurthy commented 1 month ago

@gabordozsa Do you have commit rights to the skupper-router github repo ? If not, I will ask @ted-ross or Andy Smith to give you commit rights.

@ganeshmurthy No, I do not have write access yet.

I have emailed @ted-ross and Andy Smith to give you commit rights for skupper-router github repo. I will keep you posted.

ganeshmurthy commented 1 month ago

@gabordozsa I will commit this PR to main.