home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.99k stars 30.17k forks source link

Modbus integration sometimes reports excessive values #119649

Closed Gnarfoz closed 2 months ago

Gnarfoz commented 3 months ago

The problem

A couple of times a day I get a wildly excessive value for a (mostly?) random Modbus sensor. Excessive meaning values being in the millions or billions for things that are supposed to be in the thousands or tens of thousands. Things like power usage in Gigawatts. Negative and positive values are affected.

At first I thought it was a one-off thing, then I blamed badly shielded network cables. It keeps happening now, and I've changed cables as well. I've also come across reports from other people, so it's unlikely to be caused by something specific to my setup. Firmware on the involved devices is the same as before this behavior started appearing.

The debug log I've gathered over 24 hours includes it happening twice to the same sensor and once to another sensor. For context purposes, I've included 2 transactions before and after each occurrence. For comparison purposes, I've included a "good" and a "bad" transaction for each occurrence. It is obvious that for the bad transactions, the response is quite a bit longer than for the good transactions.

image

What version of Home Assistant Core has the issue?

core-2024.6.2

What was the last working version of Home Assistant Core?

March or April 2024

What type of installation are you running?

Home Assistant Container

Integration causing the issue

Modbus

Link to integration documentation on our website

https://www.home-assistant.io/integrations/modbus/

Diagnostics information

debug.log

Example YAML snippet

modbus:
  - name: SungrowSHx
    type: tcp
    host: !secret sungrow_modbus_host_ip
    port: !secret sungrow_modbus_port
    sensors:
      - name: Meter active power raw
        unique_id: sg_meter_active_power_raw
        device_address: !secret sungrow_modbus_slave
        address: 5600 # reg 5601
        input_type: input
        data_type: int32
        swap: word
        precision: 0
        unit_of_measurement: W
        device_class: power
        state_class: measurement
        scale: 1
        scan_interval: 10

      - name: Reactive power
        unique_id: sg_reactive_power
        device_address: !secret sungrow_modbus_slave
        address: 5032 # reg 5033
        input_type: input
        data_type: int32
        swap: word
        precision: 0
        unit_of_measurement: W
        device_class: power
        state_class: measurement
        scale: 1
        scan_interval: 10

Anything in the logs that might be useful for us?

2024-06-12 00:57:59.361 DEBUG (MainThread) [pymodbus.logging] send: 0x41 0xbc 0x0 0x0 0x0 0x6 0x1 0x4 0x15 0xe0 0x0 0x2
2024-06-12 00:57:59.363 DEBUG (MainThread) [pymodbus.logging] recv: 0x41 0xbc 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x41 0xbc 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x0 0x0 old_data:  addr=None
2024-06-12 00:57:59.363 DEBUG (MainThread) [pymodbus.logging] Processing: 0x41 0xbc 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x41 0xbc 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x0 0x0
2024-06-12 00:57:59.363 DEBUG (MainThread) [pymodbus.logging] Factory Response[ReadInputRegistersResponse': 4]
2024-06-12 00:57:59.363 DEBUG (MainThread) [pymodbus.logging] Getting transaction 16828
2024-06-12 00:57:59.363 DEBUG (MainThread) [pymodbus.logging] Frame check failed, missing part of message len 11, MBAP len 260 !!

Additional information

I've resorted to sprinkling a generous amount of max_value / min_value properties across my configuration, to at least limit the impact on long term statistics and make history graphs usable again. (One of these excessive values in a history graph dwarfs everything). In some cases, that's not practical (ever-increasing "total" sensors that will at some point reach those limits). Also, max_value and min_value force the excessive value to the defined limit, thus affecting statistics. IMHO frames that failed such checks should be dropped from further processing, no?

home-assistant[bot] commented 3 months ago

Hey there @janiversen, mind taking a look at this issue as it has been labeled with an integration (modbus) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `modbus` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign modbus` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


modbus documentation modbus source (message by IssueLinks)

janiversen commented 3 months ago

Frame check failed means the packet is dropped, no further processing, so your problem is something else, I will investigate

janiversen commented 3 months ago

In need a more complete debug log, the couple of lines you have provided does not show a problem.

janiversen commented 3 months ago

your device duplicates the response: 0x41 0xbc 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x41 0xbc (message should stop here, because you request 4 bytes) 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x0 0x0 (this mdap header is not correct, because the length field suggest 260 bytes).

Apart from that the value is 934332, which seems very high.

So to sum up, analyzing your log, it looks as if your device sends 2 responses, and the second one overlaps the first.

janiversen commented 3 months ago

ooh and btw with tcp/ip the cables are not the cause....unless of course HA connect to a rs485 converter, in which case the converter could have a problem.

Gnarfoz commented 3 months ago

Here's the full log: debug-log.gz

TCP/IP should indeed prevent transmission errors. There's no converter involved, just plain old ModbusTCP + TCP/IP over Ethernet.

Is it possible this is a case of "after switching to async calls, more stuff is caught that would have previously never been visible"?

janiversen commented 3 months ago

No this have nothing to do with the change to async.

The library below the modbus integration prints what it receives, maybe your device sends double packets....you need to check that with Wireshark or a similar tool that sniffs on the ethernet.

I cannot tell you why/where the duplicate packet originates, but I see modbus handling it correctly, value of first packet is set in the entity (although with a high value, due to the duplication), and the second packet is discarded.

Gnarfoz commented 3 months ago

Odd, nothing but HA versions changed in the mean time, so of course, the suspicion falls there. ;-) I'll capture traffic for a while and see what that brings to light.

janiversen commented 2 months ago

Any new findings ? if not this will be closed as solved.

Gnarfoz commented 2 months ago

Hey, sorry for the delay. I had successfully captured an instance of this happening, but wasn't sure how to further interpret the data. modbus-capture.zip ha.log

The issue occurs at 2024-06-27 23:19:36.725 (timestamps match between packet capture and HA log). The affected transaction is transaction 14842 (0x39 0xfa). However, the packet capture shows that it is the transaction before that (14841 / 0x39 0xf9) that has a TCP retransmission occur.

I'm still not sure where that leaves us. Is this something that's unfixable? It started happening without any change to the setup except for keeping HA up to date (i.e., notably, no firmware updates). Maybe my understanding is incorrect, but shouldn't TCP retransmissions be invisible to the application?

janiversen commented 2 months ago

Nice analysis, I will examine it in detail and I expect I at least should be able to see where the problem originates.

I really wish others would do a similar thorough analysis, instead of just yelling "there is a problem".

janiversen commented 2 months ago

It is clearly your device having problems:

Transaction 14842 seen from pymodbus:

send: 0x39 0xfa 0x0 0x0 0x0 0x6 0x1 0x4 0x13 0x98 0x0 0x2
recv: 0x39 0xfa 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0x0 0x39 0xfa 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0x0 0x0 0x0 old_data:  addr=None

Looking at the pcap file we see the same data:

Captura de pantalla 2024-07-08 a las 14 34 55

Because the data seen in the pcap file are the same as pymodbus reports, we can safely exclude all home assistant components including the hardware. The extra data originates from your device (If it was the network, it would be in an extra packet).

The good thing is that the next transaction is ok.

Now looking at the values:

Transaction 14840 (request 1 register): 0x00 0x00 14841 (request 1 register): 0x00 0x00 14842 (request 2 registers): 0x0 0x0 0x39 0xfa (the 2 last bytes are start of a new frame) 14843: (request 2 registers): 0x1 0x48 0x0 0x0 14844: (request 2 registers): 0xff 0xfe 0xff 0xff

Since the data part of 14842 is overwritten, this is not something pymodbus can detect !!! This is a problem in the device and must be solved there.

A reason, why you have not seen it earlier, might be because the new modbus version is a lot faster, and thus sends the requests a lot faster. Using different scan_intervals will help prevent this type of problem.

So sorry, but I hope I have shown this is not a home assistant problem.

janiversen commented 2 months ago

Closing as this is clearly not a bug in any home assistant component, but in the device.

Thanks again for a very useful analysis.

Gnarfoz commented 2 months ago

Thank you for narrowing it down, @janiversen! 👍

I'll have to live with setting realistic min_value and max_value parameters and cleaning up the SQLite database periodically. Not sure if it's possible to clean up the statistics automatically as well.

Would you be open to extending the logic of __process_raw_value to optionally return None if the value is above max_value / below min_value, instead of returning the max_value/min_value? Would that be a technically correct way to achieve "ignore values outside of the defined range"?

*) Its unlikely I'll be able to get this problem noticed by the right people via the device manufacturer's customer support. 🤷‍♂️

janiversen commented 2 months ago

Setting max/min sounds wrong to me....a more logical approach (but more complicated) is to make a template sensor, that compare the received value against a variable and only permit x% deviation...once compared the variable is updated with the new value.

This as well as your suggestion is something general, that should not be implemented in every integration but centrally.

Gnarfoz commented 2 months ago

Using max_value/min_value feels natural to me, because the values are all numbers that have a pretty narrow range they can physically be. 5 Gigawatts of solar power is never going to be a valid value for my PV inverter to report - or maybe exactly once and then never again. ;-)

Templating every sensor I'm reading through Modbus indeed sounds more complicated... There's already https://www.home-assistant.io/integrations/filter/#outlier, but I'm not sure I understand the parameters I would use to achieve the desired effect. It replaces outliers with the median, instead of discarding outliers completely. https://www.home-assistant.io/integrations/filter/#range has the same behavior as min_value/max_value have, it clamps anything outside the range to the min or max.

Maybe I should try creating a "range_discard" filter that does almost what the range filter does but returns the unavailable or the previous value instead of clamping...

Either way, new avenues to explore and not something this particular integration can resolve. :-)