hoggyhoggy / givenergy-modbus-async

A python library to access GivEnergy inverters via Modbus TCP on a local network, with no dependency on the GivEnergy Cloud.
Other
1 stars 2 forks source link

lack of robustness in client socket i/o #24

Open divenal opened 2 months ago

divenal commented 2 months ago

We may be doing the async i/o slightly wrong, because when there's a socket error, it seems to just hang, rather than resume. Background: my rpi has both wifi and ethernet plugged int a powerline adapter. The latter has higher priority when it's on, but we usually have it turned off.

When I started my long-running script, the ethernet happened to be on, so it used that interface. But later it got turned off. The script appears to have frozen - probably turned it off around 10pm, but at 8am next morning, it was still jammed. When I hit ^c to interupt it, it revealed the pending exception. The full traceback is below (I think I had to press ^c a few times to actually get it to exit), but the salient but it seems to be:

  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Connection timed out
ERROR:asyncio:Task exception was never retrieved

so it seems that when exceptions ocurr, the coroutine just seems to sit there? I guess the network writer task was blocked in drain(), trying to send the data, and my "foreground" task, doing a refresh_plant() didn't timeout. (It can timeout waiting for response, but perhaps it waits forever for the send ? I'll need to check.)

I'll do some investigation.

   0   0.0   1.2    1560 1560.0 1568.0    1438  1739 1739.0   11 True -42
   0   0.0   1.2    1560 1560.0 1567.6    1438  1739 1739.0   11 True -43
   0   0.0   1.1    1560 1560.0 1567.2    1438  1739 1739.0   11 True -44
^CTraceback (most recent call last):
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/./monitor2.py", line 61, in monitor
    await client.refresh_plant(False, max_batteries = 0)
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 145, in refresh_plant
^C  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 309, in send_request_and_await_response
    await self.tx_queue.put((raw_frame, frame_sent))
  File "/usr/local/lib/python3.11/asyncio/queues.py", line 120, in put
    await putter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/./monitor2.py", line 154, in <module>
    asyncio.run(monitor())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='network_consumer' coro=<Client._task_network_consumer() done, defined at /extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py:177> exception=TimeoutError(110, 'Connection timed out')>
Traceback (most recent call last):
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 236, in _task_network_producer
    await self.writer.drain()
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 380, in drain
    raise exc
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 189, in _task_network_consumer
    frame = await self.reader.read(300)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 711, in read
    await self._wait_for_data('read')
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 543, in _wait_for_data  
    await self._waiter
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Connection timed out
ERROR:asyncio:Task exception was never retrieved
future: <Task finished name='network_producer' coro=<Client._task_network_producer() done, defined at /extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py:231> exception=TimeoutError(110, 'Connection timed out')>
Traceback (most recent call last):
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 236, in _task_network_producer
    await self.writer.drain()
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 380, in drain
    raise exc
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 189, in _task_network_consumer
    frame = await self.reader.read(300)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 711, in read
    await self._wait_for_data('read')
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 543, in _wait_for_data  
    await self._waiter
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Connection timed out
^C
hoggyhoggy commented 2 months ago

Right ok, so that's news. Definitely need to sort that one out. Let me see if I can recreate similar here.

divenal commented 2 months ago

I've been pondering more generally... should the i/o "threads" try to reconnect if there's a disconnect, or should that just get reported back to the application code. One argument is that since the application does an explicit connect(), it should be informed about socket problems, and it's the application's problem to re-establish the connection. But you could also say that the connect() call is just telling the client that you're ready to go into a connected state, and would like the i/o threads to maintain that for you.

There's always an issue that if there's an underlying network problem (eg router has got turned off), you need some mechanism to throttle how often you try to connect. The application really ought to be aware if it's disconnected from the inverter for any significant period of time. (Which might support that the argument that the application should be made responsible for reconnecting as required.)

divenal commented 2 months ago

Doh! Not being the brightest, I managed to do exactly the same thing again. But because I'm now logging to a file, I have a little more info...

2024-04-23 21:38:57:     0    0.0   33.0   1131 1131.1 1140.3     859  1269 1269.6  11 1 461
2024-04-23 21:43:58:     0    0.0   31.4   1131 1131.0 1139.8     858  1268 1268.4  11 1 462
2024-04-23 21:48:58:     0    0.0   29.8   1126 1127.3 1139.2     990  1266 1266.6  11 1 463
2024-04-23 21:53:58:     0    0.0   28.3   1122 1123.3 1138.4     968  1265 1265.4  11 1 464
2024-04-23 21:58:58:     0    0.0   26.9   1124 1123.8 1137.7     981  1263 1263.6  11 1 465
2024-04-23 22:03:59:     0    0.0   25.6   1124 1124.0 1137.0     981  1263 1263.1  11 1 466
2024-04-23 22:09:00:     0    0.0   24.3   1124 1124.0 1136.3     981  1263 1263.0  11 1 467
...
2024-04-24 00:02:35:     0    0.0    7.9   1124 1124.0 1128.0     981  1263 1263.0  11 1 489
2024-04-24 00:07:56:     0    0.0    7.5   1124 1124.0 1127.8     981  1263 1263.0  11 1 490

The numbers aren't too important - 3 columns of solar, 3 columns of AC generation, then export and battery. I was doing a timed discharge at around 1200W until 22:30, so AC generation was around 1130 and export was around 900W. I probably turned the ethernet off around 10pm, and the lines after 22:03:59 have stopped changing - the updates had presumably stopped arriving. But it continued to log the stale numbers until 00:07:56. I'm guessing this is the point that the send task jammed up, and so the application thread stopped because it blocks waiting for the requests to actually get sent out.

Then ^C on the script this morning, I get similar backtraces:

The application task was waiting for the refresh request to get sent out:

  File "/tmp/./monitor3.py", line 58, in monitor
    await client.refresh_plant(False, max_batteries = 0)
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 145, in refresh_plant
^C  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 309, in send_request_and_await_response
    await self.tx_queue.put((raw_frame, frame_sent))
  File "/usr/local/lib/python3.11/asyncio/queues.py", line 120, in put
    await putter
asyncio.exceptions.CancelledError

I'm guessing the network producer was stuck:

future: <Task finished name='network_producer' coro=<Client._task_network_producer() done, defined at /extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py:231> exception=TimeoutError(110, 'Connection timed out')>
Traceback (most recent call last):
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 236, in _task_network_producer
    await self.writer.drain()

And the consumer timed out as before:

Task exception was never retrieved
future: <Task finished name='network_consumer' coro=<Client._task_network_consumer() done, defined at /extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py:177> exception=TimeoutError(110, 'Connection timed out')>
Traceback (most recent call last):
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 236, in _task_network_producer
    await self.writer.drain()
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 380, in drain
    raise exc
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 189, in _task_network_consumer
    frame = await self.reader.read(300)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 711, in read
    await self._wait_for_data('read')
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 543, in _wait_for_data
    await self._waiter
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: [Errno 110] Connection timed out
Task exception was never retrieved

I'm actually wondering if producer task is really needed. The requests get added to the write queue, but then the task is blocked until the producer actually writes it out. So arguably the application task may as well send it itself. The producer adds two effects - enforces a delay between writes (which wouldn't be hard to implement a different way), and it sends out the heartbeat reply (which is requested by the reader thread without blocking waiting for it to get sent). Slightly harder to avoid that, since we want to avoid blocking the reader thread. But it could just write it to the socket without doing an explicit drain() - then it can just sit in the i/o buffers.

divenal commented 1 month ago

A related issue... having just discovered the simple http server on the dongle, I see I can reset the dongle from there. And similar sort of issue - the active givenergy_modbus connection didn't immediately appear to notice. I got a couple of read timeouts. so I interrupted, and got

  File "/usr/local/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/tmp/./monitor4.py", line 187, in monitor
    await asyncio.sleep(delay)
  File "/usr/local/lib/python3.11/asyncio/tasks.py", line 649, in sleep
    return await future
           ^^^^^^^^^^^^
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/./monitor4.py", line 204, in <module>
    asyncio.run(monitor())
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 190, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/asyncio/runners.py", line 123, in run
    raise KeyboardInterrupt()
KeyboardInterrupt
Task exception was never retrieved
future: <Task finished name='network_consumer' coro=<Client._task_network_consumer() done, defined at /extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py:177> exception=ConnectionResetError(104, 'Connection reset by peer')>
Traceback (most recent call last):
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 244, in _task_network_producer
    await self.close()
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 113, in close
    await self.writer.wait_closed()
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 364, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer
Task exception was never retrieved
future: <Task finished name='network_producer' coro=<Client._task_network_producer() done, defined at /extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py:231> exception=ConnectionResetError(104, 'Connection reset by peer')>
Traceback (most recent call last):
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 244, in _task_network_producer
    await self.close()
  File "/extra/drd/venv/testge2/lib/python3.11/site-packages/givenergy_modbus/client/client.py", line 113, in close
    await self.writer.wait_closed()
  File "/usr/local/lib/python3.11/asyncio/streams.py", line 364, in wait_closed
    await self._protocol._get_close_waiter(self)
  File "/usr/local/lib/python3.11/asyncio/selector_events.py", line 999, in _read_ready__data_received
    data = self._sock.recv(self.max_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

(But other than these issues, I've been quite impressed by the stability of the connection - script just runs and runs most of the time.)