theOehrly / Fast-F1

FastF1 is a python package for accessing and analyzing Formula 1 results, schedules, timing data and telemetry
https://docs.fastf1.dev
MIT License
2.54k stars 264 forks source link

``SignalRClient.start()`` may get stuck on keepalive timeout and closed TCP connection #628

Open theOehrly opened 3 months ago

theOehrly commented 3 months ago

Discussed in https://github.com/theOehrly/Fast-F1/discussions/627

Originally posted by **supremeremixx** August 19, 2024 Hello! I'm trying to run endless loop with SignalRClient. ``` import time import logging from fastf1.livetiming.client import SignalRClient log = logging.getLogger() log.setLevel(logging.DEBUG) def signalr(): client = SignalRClient(filename="output.txt", timeout=1800, debug=True) client.start() while True: signalr() time.sleep(60) ``` After random amount of time (3, 5 , 11 hours...) I get keepalive timeout error. In logs I see that TCP connection is closed, but function client.start() still blocked, it never ends. Loop is never finished and that's why program not creating a new client instance after error. ``` python3[282709]: 2024-08-09 21:36:44,532 - DEBUG: ! timed out waiting for keepalive pong python3[282709]: 2024-08-09 21:36:44,533 - DEBUG: ! failing connection with code 1011 python3[282709]: 2024-08-09 21:36:44,535 - DEBUG: = connection is CLOSING python3[282709]: 2024-08-09 21:36:44,535 - DEBUG: > CLOSE 1011 (internal error) keepalive ping timeout [24 bytes] python3[282709]: 2024-08-09 21:36:54,536 - DEBUG: ! timed out waiting for TCP close python3[282709]: 2024-08-09 21:36:54,536 - DEBUG: x closing TCP connection python3[282709]: 2024-08-09 21:37:04,539 - DEBUG: ! timed out waiting for TCP close python3[282709]: 2024-08-09 21:37:04,539 - DEBUG: x aborting TCP connection python3[282709]: 2024-08-09 21:37:04,540 - DEBUG: = connection is CLOSED python3[282709]: 2024-08-09 21:59:44,619 - WARNING: Timeout - received no data for more than 1800 seconds! ``` I've tried running it with and without timeout (in SignalRClient), but the function remains locked. Is it possible to somehow terminate it after receiving an error?
supremeremixx commented 2 months ago

Hello @theOehrly ! I made a little research so maybe this information will help you to fix this bug.

When I tried to run SignalRClient with timeout, after receiving an error and the timeout time has expired, the function client.start() was stucked in loop while self._connection.started in _supervise (livetiming->client.py).

    async def _supervise(self):
        self._t_last_message = time.time()
        while True:
            if (self.timeout != 0
                    and time.time() - self._t_last_message > self.timeout):
                self.logger.warning(f"Timeout - received no data for more "
                                    f"than {self.timeout} seconds!")

                self._connection.close()
                while self._connection.started:
                    await asyncio.sleep(0.1)
                return

            await asyncio.sleep(1)

Apparently, since the connection has already been closed due to the error that occurred, the self._connection.close() function does not set the self._connection.started flag to False. For testing I manually added self._connection.started = False after close() function and SignalRClient was shutted down!

For another example, you can add these lines to the received message handler. If only the self._connection.close() function is there, I can see in the logs how the connection is closed, but the function remains in a blocking state (in fact it is a simulation of an error occurrence), but if you add self._connection.started = False after that, the function is terminated.

In normal conditions after self._connection.close() is called it trying to call self.__transport.close() (signalr_aio->transports->_transport.py) which is sents CloseEvent() to ws_loop.

def close(self):
    asyncio.Task(self.invoke_queue.put(CloseEvent()), loop=self.ws_loop)

And when _producer_handler receive this event it trying to close websocket connection and set self._connection.started = False (essentially the same behavior as in _supervise)

  async def _producer_handler(self, ws):
      while True:
          try:
              event = await self.invoke_queue.get()
              if event is not None:
                  if event.type == 'INVOKE':
                      await ws.send(dumps(event.message))
                  elif event.type == 'CLOSE':
                      await ws.close()
                      while ws.open is True:
                          await asyncio.sleep(0.1)
                      else:
                          self._connection.started = False
                          break
              else:
                  break
              self.invoke_queue.task_done()
          except Exception as e:
              raise e

So as I understand this bug can be fixed by adding self._connection.started = False to some error handler of websocket connection. I tried to find it myself, but couldn't. That is why I decided to detail my observations for you.

Hope it helps!

theOehrly commented 2 months ago

@supremeremixx thank you for investigating this more. This is helpful information, for sure. Maybe I'll be able to fix it soon.

Also, it would probably be best to raise an exception here instead of simply returning from the function. After all, this is not an intended or expected way to close the connection, but rather it is some sort of connection error.