Closed Gnarfoz closed 2 months ago
Hey there @janiversen, mind taking a look at this issue as it has been labeled with an integration (modbus
) you are listed as a code owner for? Thanks!
(message by CodeOwnersMention)
modbus documentation modbus source (message by IssueLinks)
Frame check failed means the packet is dropped, no further processing, so your problem is something else, I will investigate
In need a more complete debug log, the couple of lines you have provided does not show a problem.
your device duplicates the response: 0x41 0xbc 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x41 0xbc (message should stop here, because you request 4 bytes) 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0xe 0x0 0x0 (this mdap header is not correct, because the length field suggest 260 bytes).
Apart from that the value is 934332, which seems very high.
So to sum up, analyzing your log, it looks as if your device sends 2 responses, and the second one overlaps the first.
ooh and btw with tcp/ip the cables are not the cause....unless of course HA connect to a rs485 converter, in which case the converter could have a problem.
Here's the full log: debug-log.gz
TCP/IP should indeed prevent transmission errors. There's no converter involved, just plain old ModbusTCP + TCP/IP over Ethernet.
Is it possible this is a case of "after switching to async calls, more stuff is caught that would have previously never been visible"?
No this have nothing to do with the change to async.
The library below the modbus integration prints what it receives, maybe your device sends double packets....you need to check that with Wireshark or a similar tool that sniffs on the ethernet.
I cannot tell you why/where the duplicate packet originates, but I see modbus handling it correctly, value of first packet is set in the entity (although with a high value, due to the duplication), and the second packet is discarded.
Odd, nothing but HA versions changed in the mean time, so of course, the suspicion falls there. ;-) I'll capture traffic for a while and see what that brings to light.
Any new findings ? if not this will be closed as solved.
Hey, sorry for the delay. I had successfully captured an instance of this happening, but wasn't sure how to further interpret the data. modbus-capture.zip ha.log
The issue occurs at 2024-06-27 23:19:36.725 (timestamps match between packet capture and HA log). The affected transaction is transaction 14842 (0x39 0xfa). However, the packet capture shows that it is the transaction before that (14841 / 0x39 0xf9) that has a TCP retransmission occur.
I'm still not sure where that leaves us. Is this something that's unfixable? It started happening without any change to the setup except for keeping HA up to date (i.e., notably, no firmware updates). Maybe my understanding is incorrect, but shouldn't TCP retransmissions be invisible to the application?
Nice analysis, I will examine it in detail and I expect I at least should be able to see where the problem originates.
I really wish others would do a similar thorough analysis, instead of just yelling "there is a problem".
It is clearly your device having problems:
Transaction 14842 seen from pymodbus:
send: 0x39 0xfa 0x0 0x0 0x0 0x6 0x1 0x4 0x13 0x98 0x0 0x2
recv: 0x39 0xfa 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0x0 0x39 0xfa 0x0 0x0 0x0 0x7 0x1 0x4 0x4 0x0 0x0 0x0 0x0 old_data: addr=None
Looking at the pcap file we see the same data:
Because the data seen in the pcap file are the same as pymodbus reports, we can safely exclude all home assistant components including the hardware. The extra data originates from your device (If it was the network, it would be in an extra packet).
The good thing is that the next transaction is ok.
Now looking at the values:
Transaction 14840 (request 1 register): 0x00 0x00 14841 (request 1 register): 0x00 0x00 14842 (request 2 registers): 0x0 0x0 0x39 0xfa (the 2 last bytes are start of a new frame) 14843: (request 2 registers): 0x1 0x48 0x0 0x0 14844: (request 2 registers): 0xff 0xfe 0xff 0xff
Since the data part of 14842 is overwritten, this is not something pymodbus can detect !!! This is a problem in the device and must be solved there.
A reason, why you have not seen it earlier, might be because the new modbus version is a lot faster, and thus sends the requests a lot faster. Using different scan_intervals will help prevent this type of problem.
So sorry, but I hope I have shown this is not a home assistant problem.
Closing as this is clearly not a bug in any home assistant component, but in the device.
Thanks again for a very useful analysis.
Thank you for narrowing it down, @janiversen! 👍
I'll have to live with setting realistic min_value
and max_value
parameters and cleaning up the SQLite database periodically.
Not sure if it's possible to clean up the statistics automatically as well.
Would you be open to extending the logic of __process_raw_value to optionally return None
if the value is above max_value
/ below min_value
, instead of returning the max_value
/min_value
?
Would that be a technically correct way to achieve "ignore values outside of the defined range"?
*) Its unlikely I'll be able to get this problem noticed by the right people via the device manufacturer's customer support. 🤷♂️
Setting max/min sounds wrong to me....a more logical approach (but more complicated) is to make a template sensor, that compare the received value against a variable and only permit x% deviation...once compared the variable is updated with the new value.
This as well as your suggestion is something general, that should not be implemented in every integration but centrally.
Using max_value
/min_value
feels natural to me, because the values are all numbers that have a pretty narrow range they can physically be. 5 Gigawatts of solar power is never going to be a valid value for my PV inverter to report - or maybe exactly once and then never again. ;-)
Templating every sensor I'm reading through Modbus indeed sounds more complicated...
There's already https://www.home-assistant.io/integrations/filter/#outlier, but I'm not sure I understand the parameters I would use to achieve the desired effect. It replaces outliers with the median, instead of discarding outliers completely.
https://www.home-assistant.io/integrations/filter/#range has the same behavior as min_value
/max_value
have, it clamps anything outside the range to the min or max.
Maybe I should try creating a "range_discard" filter that does almost what the range filter does but returns the unavailable
or the previous value instead of clamping...
Either way, new avenues to explore and not something this particular integration can resolve. :-)
The problem
A couple of times a day I get a wildly excessive value for a (mostly?) random Modbus sensor. Excessive meaning values being in the millions or billions for things that are supposed to be in the thousands or tens of thousands. Things like power usage in Gigawatts. Negative and positive values are affected.
At first I thought it was a one-off thing, then I blamed badly shielded network cables. It keeps happening now, and I've changed cables as well. I've also come across reports from other people, so it's unlikely to be caused by something specific to my setup. Firmware on the involved devices is the same as before this behavior started appearing.
The debug log I've gathered over 24 hours includes it happening twice to the same sensor and once to another sensor. For context purposes, I've included 2 transactions before and after each occurrence. For comparison purposes, I've included a "good" and a "bad" transaction for each occurrence. It is obvious that for the bad transactions, the response is quite a bit longer than for the good transactions.
What version of Home Assistant Core has the issue?
core-2024.6.2
What was the last working version of Home Assistant Core?
March or April 2024
What type of installation are you running?
Home Assistant Container
Integration causing the issue
Modbus
Link to integration documentation on our website
https://www.home-assistant.io/integrations/modbus/
Diagnostics information
debug.log
Example YAML snippet
Anything in the logs that might be useful for us?
Additional information
I've resorted to sprinkling a generous amount of
max_value
/min_value
properties across my configuration, to at least limit the impact on long term statistics and make history graphs usable again. (One of these excessive values in a history graph dwarfs everything). In some cases, that's not practical (ever-increasing "total" sensors that will at some point reach those limits). Also,max_value
andmin_value
force the excessive value to the defined limit, thus affecting statistics. IMHO frames that failed such checks should be dropped from further processing, no?