KeyError: 0 when using system_mode profile

Ivan-L commented 2 years ago

Firstly, thanks again for a great add-on! The profiles feature is extremely useful!

Describe the issue/bug

On startup, the add-on (I'm currently using the dev version) refreshes the system_mode profile that I have opted into. The below error occurs, causing the add-on to crash and restart. This happens several times, perhaps 7 or 8 times, until one time the system_mode update will succeed and then all is well until the next restart. This causes startup to essentially be delayed by several minutes, but eventually the add-on does start up.

I am using a USR-W630 via TCP using the umodbus driver so that may contribute to the issue.

I will perform more experiments and post additional info as I receive it.

Expected behavior

The add-on should start up the first time without the exception being thrown.

Logs

2022-04-20 05:49:43,042 INFO    Profile System Mode action UPDATE
2022-04-20 05:49:43,042 DEBUG   PUBLISH 0 SUNSYNK/status/xxxxxxxx/system_mode, UPDATE
2022-04-20 05:49:43,129 DEBUG   Request registers: [250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261] glen=12. Response [100, 630, 830, 1530, 1800, 2048, 0, 0, 2952, 5000, 5000, 5000] len=12. regs={250: 100, 251: 630, 252: 830, 253: 1530, 254: 1800, 255: 2048, 256: 0, 257: 0, 258: 2952, 259: 5000, 260: 5000, 261: 5000}
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/sunsynk/sunsynk.py", line 52, in read_sensors
    r_r = await self.read_holding_registers(grp[0], glen)
  File "/usr/local/lib/python3.9/site-packages/sunsynk/usunsynk.py", line 43, in read_holding_registers
    return await self.client.read_holding_registers(self.server_id, start, length)
  File "/usr/local/lib/python3.9/site-packages/async_modbus/core.py", line 181, in read_holding_registers
    return await self._send_message(request)
  File "/usr/local/lib/python3.9/site-packages/async_modbus/core.py", line 142, in _send_message
    return await self.protocol._async_send_message(
  File "/usr/local/lib/python3.9/site-packages/async_modbus/core.py", line 43, in send_message_tcp
    tcp.raise_for_exception_adu(response_error_adu)
  File "/usr/local/lib/python3.9/site-packages/umodbus/client/tcp.py", line 247, in raise_for_exception_adu
    pdu_to_function_code_or_raise_error(resp_pdu)
  File "/usr/local/lib/python3.9/site-packages/umodbus/functions.py", line 118, in pdu_to_function_code_or_raise_error
    raise error_code_to_exception_map[error_code]
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/app/./run.py", line 274, in <module>
    LOOP.run_until_complete(main(LOOP))
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/usr/src/app/./run.py", line 268, in main
    await profile_poll(SUNSYNK)
  File "/usr/src/app/profiles.py", line 179, in profile_poll
    await ss.read_sensors(profile.sensors)
  File "/usr/local/lib/python3.9/site-packages/sunsynk/sunsynk.py", line 54, in read_sensors
    raise Exception(  # pylint: disable=raise-missing-from
Exception: (1,268,12) 0
[cmd] ./run.py exited 1

kellerza commented 2 years ago

Hi @Ivan-L this still seems related to the USR-W630 (KeyError: 0 from umodbus)

https://github.com/kellerza/sunsynk/issues/18#issuecomment-1047524344

Ivan-L commented 2 years ago

Hi @kellerza

I've been trying to debug this one and it was difficult. You are correct, this may well be related to the USR-W630, but I do not think it's the exact same issue as in #18. I have a suspicion that what is causing it is the number of registers being requested to be read or the response payload size being chunked/truncated somehow.

I can reproduce the issue using the modbus cli:

modbus -s 1 -v 10.0.1.66:8899 250  251  252  253  254  255  256  257  258  259  260  261

A successful request/response will yield:

→ < 56 47 00 00 00 06 01 03 00 fa 00 0c >
← < 56 47 00 00 00 1b 01 03 18 00 64 02 76 03 3e 05 fa 07 08 08 98 13 88 13 88 13 88 13 88 13 88 13 88 > 33 bytes

An unsuccessful request/response (which happens more often than the successful one) yields:

→ < fc de 00 00 00 06 01 03 00 fa 00 0c >
← < fc de 00 00 00 0e 01 03 18 00 64 02 76 03 3e 05 fa 07 08 08 > 20 bytes

Notice that in the unsuccessful case, the response is shorter.

Let's decode and analyse the successful request/response first:

Request:
MBAP (Transaction Id, Protocol, Length, Unit Id): (22087, 0, 6, 1)
PDU  (Function code, Start address, Quantity): (3, 250, 12)

Response:
MBAP (Transaction Id, Protocol, Length, Unit Id): (22087, 0, 27, 1)
PDU  (Function code, Byte count): (3, 24)
           Values: (100, 630, 830, 1530, 1800, 2200, 5000, 5000, 5000, 5000, 5000, 5000)

Now for the unsuccessful request/response:

Request:
MBAP (Transaction Id, Protocol, Length, Unit Id): (64734, 0, 6, 1)
PDU  (Function code, Start address, Quantity): (3, 250, 12)

Response:
MBAP (Transaction Id, Protocol, Length, Unit Id): (64734, 0, 14, 1)
PDU  (Function code, Byte count): (3, 24)
           Values: (100, 630, 830, 1530, 1800, ..truncated!

The response response PDU is truncated, but the response MBAP header correctly reflects the truncated PDU's size! In other words, the "envelope" is correct, but the "body" or "what's inside" is truncated in the response.

If I reduce the amount of registers being read to 8 or below, instead of 12 above, then I do not get this error at all and every read succeeds. But as soon as I read 9 registers or more, then the error occurs randomly and when the error does occur, the response PDU payload appears to be truncated at the exact same place. The System Mode profile of the add-on does require all 12 registers to be read, though I do think that they can be read in batches.

I have modified the code locally to place a limit on how many registers can be read at a time.

The group_sensors method would gain a max_group_size parameter:

def group_sensors(
    sensors: Sequence[Sensor],
    allow_gap: int = 3,
    max_group_size: int = 60
) -> Generator[list[int], None, None]:
    """Group sensor registers into blocks for reading."""
    if not sensors:
        return
    regs = {r for s in sensors for r in s.reg_address}
    group: List[int] = []
    adr0 = 0
    for adr1 in sorted(regs):
        if group and (adr1 - adr0 > allow_gap or len(group) >= max_group_size):
            yield group
            group = []
        adr0 = adr1
        group.append(adr1)
    if group:
        yield group

It can then be invoked with that parameter from read_sensors:

for grp in group_sensors(sensors, allow_gap=1, max_group_size=8):

A possible test could be:

def test_group_limit() -> None:
    sen = [
        Sensor(10, "10"),
        Sensor(11, "11"),
        Sensor(12, "12"),
        Sensor(13, "13"),
        Sensor(14, "14"),
        Sensor(15, "15"),
        Sensor(16, "16"),
    ]
    g = list(group_sensors(sen, max_group_size=6))
    assert g == [[10, 11, 12, 13, 14, 15], [16]]

Unfortunately apart from the above pytest succeeding locally, I cannot seem to build the addon with my locally modified version of the sunsynk package. The hass-addon-sunsynk-dev Dockerfile has a hardcoded dependency of version 0.1.4 of the sunsynk package. Any idea how I can test my changes locally before doing a PR?

Actually, before I even attempt a PR, would you accept such a PR? Or do you think the max group size should be configurable and be an option in config.yaml?

Ivan-L commented 2 years ago

I figured out how to test locally by manually applying what copy2local.cmd does.

I confirm that limiting the add-on to reading a maximum of 8 registers at a time solves the above problem - the issue has not reoccurred even once and I have restarted the add-on many times.

So my above question still stands - do you think this should be a setting in config.yaml with some high default value?

kellerza commented 2 years ago

@Ivan-L thanks for looking into this. Will surely accept a PR.

An option with a high default (50?) would make sense. It should be fairly easy to add once the library supports it

Ivan-L commented 2 years ago

Sorry for the delay @kellerza, I got sidetracked by a lot of other stuff. The PR is up.

kellerza / sunsynk