karloygard / hassio-addons

My Hass.io addons
6 stars 2 forks source link

Stops sending messages #12

Closed austadt closed 3 years ago

austadt commented 3 years ago

Hi!

First; Amazing job, finally someone with the skills needed are making a good solution for us!

I have a somehow large installation, around 150 datapoints, and this weekend i moved 20 of them over to HA for testing. It works very well!

But after som hours it stops sending messages. It still read updates fine, if i uses a wallswith to dim some lights, the update just fine, but refuses to actually send anyhing itself. I have tried to directly publish to the mqtt topic, no go(Made sure that it works before addon "crash" Simple restart of addon fixes it. Happenend some hours after each restart

Worth to mention is that in early testing i added 3 actuators, uploaded dp-list, all good. 5 more actuators, all good. After that the addon refused to check/download the dp list, so i just uploaded the dpl file to homeassistant config folder

Im on a Intel NUC, Virtualbox with newest homeassistant version. Addon is at 0.28.1

karloygard commented 3 years ago

Hmm, can you attach the logs from such an incident? Turn on verbose in order to get some extra logging. Also, to figure out if it's the mqtt code or the xcomfort code that's failing, you can operate a light switch and see if it shows up in the logs. If it shows up in the logs but doesn't respond to mqtt events, the issue is obviously on the mqtt side.

Also, make sure you are not setting the use_hidapi option, it has similar issues.

Uploading the dp list to the usb stick should work just fine. If you see it again, please grab the logs.

austadt commented 3 years ago

Ive just activated verbose now. It has happened again somewhere between my last post and this one. It still show correct values on everything, and recieves updates from xcomfort devices, but nothing happens when i try to set a dimmer or actuate a relay. It doesent show up in the logs either. I will come back later today with logs.

If you are correct the error is on the mqtt side of it

hidapi is off

austadt commented 3 years ago

When I actuate something manually i get theese in the log 2021/02/15 20:32:21 Device 6562350, type CSAU 01/01-16 sent extended status message: status ON, temp 37C (battery mains-powered, signal good) 2021/02/15 20:31:50 Device 6758146, type CSAU 01/01-16 sent extended status message: status OFF, temp 32C (battery mains-powered, signal good) When trying to control them from within Home Assistant, or with direct mqtt message from NodeRed(xcomfort/3/set/dimmer, yes this is not the same destination, but it works before the error occurs, not after with the given entity) nothing happens in log or with the actuator of course. It works for some hours, then cant control any one of the datapoints

If i monitor the xcomfort/13/set/switch in NodeRed with a debug i can se that the commands gets out from HA. I attach a screenshot from node red. The first two events is me trying to operate a switch inside HA. Event no 3 and 4 is me turning the actuator on and off directly in MRF. Something goes wrong between MQTT broker and xComfortd-go, I belive

Screenshot NR xcomf integrasjon

All i can think of that i have done "wrong" is that i have included two datpoints you dont support, KwH report from heating actuators. The momentary (watt) works fine. Ill try do delete them to see if things improve

I have not altered anything else than dpl-placement. Config:

mqtt_client_id: xcomfort datapoints_file: Dataspunkt.dpl eprom: false ha_discovery: true ha_discovery_prefix: homeassistant ha_discovery_remove: false verbose: true use_hidapi: false eci_hosts: []

I have teamviewer if you want to check it out yourself, the error is active right now

karloygard commented 3 years ago

Nothing interesting in the logs?

Just attach the DPL file, and I can double check that everything looks fine. It's pretty straightforward to add support for new device types, I just don't have these devices myself, so it's a little hit and miss to get things right.

austadt commented 3 years ago

txt version, github wont allow dpl files DP 7 and 17 are the ones

I cant see anything, but i cant see very far back in the log. Is it stored somewhere in a file? If so i can backtrace a little?

Dataspunkt.txt

karloygard commented 3 years ago

Should be trivial to support the so called next gen heating actuators. But, I doubt that they are responsible for your issues.

Tbh, I haven't bothered to figure how to extract full logs from hassio. You can remove the verbose flag again, and errors messages shouldn't drown completely. In any case, we'll figure this out, but I need a hint as to where in the code things are failing. Are you sure that the MQTT server is not the culprit?

austadt commented 3 years ago

Im not sure of anything. The installation is all fresh, the only thing i use mqtt for here are this. I have not changed anything, just followed the installationguides for your integration, and the mqtt one, and oneclicked it to ba a broker. everything is running on a VM with bridged adapter, and i use the NUC Virtualbox official image

Just came to my mind to check mqtt logs. 1613393141: Socket error on client xcomfort, disconnecting.

But i have restarted xcomfortd-go a couple of times today, not sure if this is that. i will take a full reboot of the VM now to see if this is unprovoked or a result of rebooting only the addon

karloygard commented 3 years ago

I pushed 0.29 now, with some support for the new heating actuators. The MQTT library is also bumped, but I doubt that this will resolve your issues. However, who knows. Give it a spin and let me know.

austadt commented 3 years ago

I updated as soon as it was pushed - no go The issue still arises some hours after restart, im guessing 3 hours. Last evening i removed one device completly - the heating actuator. CHAU-01/01-10E (in HA it shows up with the correct devicetype, heating actuator, and it looked like it was controllable, on and off like a normal switching act. After i removed it the system has been stable, mostly. Earlier when the error has emerged the hole outgoing part of it has stopped working, incoming messages had worked. After I removed the heating actuator everything is working, exept after roughly the same amount of time one device stops sending. I dont know if its random, but its a old version switching actuator, the thicker one. This is a different kind of error, integration recives messages if i actuate it from MRF, just like before, but when i try to control it from HA nothing happens, but i can see the command in the log - but nothing on the RF side

Skjermbilde 2021-02-17 210554

The two first regarding "Panelovn kjellerstue" is me turning it on/off directly in MRF, the last DP13 is me trying to control them in HA

karloygard commented 3 years ago

Okay, I am going to need more logging. Can you turn verbose back on and grab some logs? One from normal operation, hopefully one from around when things are going wrong, and finally a log from when things have failed.

I also need to see which firmware version the USB stick is running, that's logged first thing.

austadt commented 3 years ago

Hi again Due to bad placement of my intel nuc i have paused the testing with the addon. I have installed the non-hassio version on a headless rpi and written a small systemctl script that autoconnects it to my mosquito broker running on the nuc. It works very well, and is a part of the road to have at least to usb ci carfully placed around the house. I will shortly check if the issue with non-responding actuator, but todays testing has been positive - i have not done anything else than moved my installation.

But further testing reveals something else. When i add a light scene in HA and add several(3) actuators, the messages to all of them is sendt very fast. in 10 attempts, 9 of them resulting in only one dimmer acting. the last attempt actuated 2. the third never responded. This is the the log:

Feb 19 22:13:22 rpioppe xcomfortd-go[531]: 2021/02/19 22:13:22 MQTT message; topic: 'xcomfort/1/set/dimmer', message: '8' Feb 19 22:13:22 rpioppe xcomfortd-go[531]: 2021/02/19 22:13:22 MQTT message; topic: 'xcomfort/3/set/switch', message: 'false' Feb 19 22:13:22 rpioppe xcomfortd-go[531]: 2021/02/19 22:13:22 handle_events: error: libusb: interrupted [code -10]

Ill turn on verbose to check for more logging

I know sensio runs standard 800ms wait between messages, and i have successfully run my installation at 500ms pause without any msg loss

karloygard commented 3 years ago

Can you grab the start of the logs, it says which firmware the USB stick is running. I have a sneaking suspicion that your firmware needs to be updated.

austadt commented 3 years ago

This log is huge.. Ive activated 2 different scenes back and fourth several times. Suddenly one of the actuators "dies", you can se them at the very end of the log

I tried to update fw yesterday, but the one i found online looked like the same that already was installed. maybe im mistaken Im setting up a scene with more actuators to see what happens

Logfile.txt

austadt commented 3 years ago

After some runs the CI stops sending to a couple more actuators. The last 4 are hanging in there

New log file, continiuing from last point: [Uploading Logfile.txt…]()

austadt commented 3 years ago

Sorry, trying again Logfile.txt

karloygard commented 3 years ago

Thanks, logs are good. Your firmware is the latest that I am aware of, so that's not the issue.

Can you point out timestamps in the log where you think things are going wrong, so we are looking at the same messages? Are you building xcomfortd-go yourself (might make things a little easier to figure out)?

austadt commented 3 years ago

I have just rebooted my pie to ensure everything is online - i can now control all devices. i check the logs after every on/off command, i get normal mqtt in, response from dimmer are ok. Then i activate one scene that contains 7 devices. Already at the first run i get 7 mqtt messages, and 6 responses, the 7th devices now seems to be banned from getting messages from theCI stick. I can still control it with pushbutton and Sensio, but not via CI stick plugged in the rpi.

Yes as from this evening im using the non-ha repo, build with go version go1.16 linux/arm, started with this command: /home/pi/xcomfortd-go/xcomfortd-go -eprom -verbose -hadiscovery --server tcp://rpi:password@homeassistant.lan:1883

Log from this spesific event: Logfile2.txt

At 23:07:36 there is an libusb error. I belive this is what causes the error. i have no clue whats behind the code here, but it feels like the driver is dropping all communication to the spesific object after it has occured. To be clear, i still get status messages when controlling the actuator from a another source

karloygard commented 3 years ago

The libusb error can be ignored, it's nothing to worry about. I'll dig into the logs a bit, it kinda looks like there are messages that are silently being lost, which would cause total breakdown. However, it's getting late, I'll inspect more tomorrow.

karloygard commented 3 years ago

I've added some extra logging to get a better idea of what is failing. The main issue seems to be that you are updating many actuators in parallel, leading to flooding of some sort. I've spotted at least one message that gets dropped on the ground, directly leading to the behaviour you describe. I'd like to better understand if this is a frequent issue, in order to come up with a fix.

Try the new code and send me logs after things have failed. If you can get several switches to stop responding, that would be even better.

austadt commented 3 years ago

im sorry for late reply.. Updated and rebuild the repo saturday morning

Since last i havent used home assistant very much, due to a lack of spare time, but almost every time I have tested it (lets say each 6th hour ish) most of the devices has been "offline", fixed by reboot

I have just done som test-cycles, and after one of the reboots i had a timeaccount issue, immediately after reboot(i reboot using SSH, not powercycle, always, as its fast and easy) The first lines in this log is from that boot. The same boot also failed to load EPROM, so had to reboot once more(done a few minutes later)

Then i switched a few things on, so all my lights in the scene was on, then started to activate my "normal" and "dimmed" scene. In between i have adjusted some small scene values for easier spotting. Around 23:40 two actuators fails. The datapoints failing in this log is 19 and 22

timeaccount.txt

karloygard commented 3 years ago

Devices on the 868 MHz band may only transmit 1% of the time, the time account message means that the stick has spent that 1% and needs to wait a bit to send more messages. I'm a bit impressed, how messages are you sending per hour?

Thanks for the logs, will inspect when I have a bit more time.

austadt commented 3 years ago

In total the system might be in the upper scale for a small house, but its not THAT big. I have 2 Sensiocontrollers running - this is the path in migrating all over to HA, hopefully

I have never run into timeaccount-issues with Sensio - but care has been taken to avoid it. All temperature-sensors send at medium accuracy, and with 55minutes cyclic sending(the heating actuators are controlled with timer, if no new "on"message has been recieved after 2 hours it turns itself off(timer is locally set up on the actuator) and sensio sends out a new on-message for each temperaturemessage it gets. That means it can miss one message and still run uninterupted, but shuts off in the controller dies etc. The CI on the raspberry just recieves the same messages, and the only messages it puts out "by itself" if status ack, and two heating zones i have programmed in the same way as sensio via node red - it resends "on" every 15 minutes i think. The timeaccount MAY have been provoced by rapidly changing from "normal" to "dimmed" scene several times, but i cant recall if that was done here. I suspect that the timeaccount is programmed in another timescale, maybe 5 minutes or 15, not an full hour, Also, i am not sure of the CI "picks up" the messages not ment for it and counting them as well - anyway that should not be a problem, as Sensio never has reported timeaccount error, exept maybe a couple of times when the kids are playing in our bed and activates night/morning scenes every time.

One way around the timeaccount is to program the scenes directly i MRF. I know many xcomfort-guys like it that way, but i love/want the flexibility by having it in the top-system.

Do you think adjusting the wait-time between TX-messages might help with this issue im experiencing? Or are the error somewhere else? Can my stick be faulty?

austadt commented 3 years ago

Skjermbilde 2021-02-22 002752

karloygard commented 3 years ago

I have a suspicion what the issue might be, I've added some extra logging, please give the code a spin and send me logs after things have failed. This should let me see if this indeed is the issue.

The timeaccount issue may simply be due to your testing. If this arises in real life use, we might have to come up with a solution, but I'm not too concerned at the moment.

austadt commented 3 years ago

Todays logfile

One actuator failes immidiatly, i dont think it reacted to the first command, but dident quite notice. The second fail at roughly 19:17:22. Then a very slow actuating around 19:19:26, but it might just wait for its turn to send. Next actuator fails at 19:20:28 aprox

Logfile.txt

karloygard commented 3 years ago

Regrettably, my suspicion was wrong, so it's back to the drawing board.

I'll try to reproduce the issue here, but I've added even more logging, so you can test that in the meanwhile, if you have time.

austadt commented 3 years ago

Ill take a look tonight. Thank you for all your effort!

Im sorry for all the extra work this is producing for you

I have another consern about the program also. It seems that if the MQTT broker(using the official addon in HA) goes offline – if i reboot the HA VM - It will never reconnect. I havent checked, but the program even might get terminated. I autorun it via systemdctl-script, and have a wait-for-network-connected string in there somewhere

Fra: Karl Anders Øygard notifications@github.com Sendt: onsdag 24. februar 2021 14:22 Til: karloygard/hassio-addons hassio-addons@noreply.github.com Kopi: Tore Austad Tore.Austad@nyinst.no; Author author@noreply.github.com Emne: Re: [karloygard/hassio-addons] Stops sending messages (#12)

Regrettably, my suspicion was wrong, so it's back to the drawing board.

I'll try to reproduce the issue here, but I've added even more logging, so you can test that in the meanwhile, if you have time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/karloygard/hassio-addons/issues/12#issuecomment-785070519, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJF5SOCIZ5D7E7ZWO4V65BLTAT4QHANCNFSM4XUCRHGA.

austadt commented 3 years ago

Probably randam, but it took a little longer to get a fail today. It happens right before 18:36. Two actuators, I think one of them failed one cycle or something before the other, but I dident catch it

Logfile.txt

karloygard commented 3 years ago

I was able to reproduce the issue locally, it looks like some messages are dropped silently. I would not be surprised if this is a bug in the USB stick, I have seen that in the past, although not at this level. :(

In any case, give the latest iteration a shot. FYI, I have not made any changes to the MQTT yet.

dreeas commented 3 years ago

Hi, I made a comment on this other tread as I am having the same issue on a new install I made for testing. Let me know if there is anything I can do to help troubleshooting.

br Andreas

karloygard commented 3 years ago

@dreeas Can you clarify a little the issues that you are seeing? Not sure that this is the same issue that @austadt is seeing.

dreeas commented 3 years ago

Sure.. It is just what @austadt described in the first post. After some time it stops sending messages but receiving is fine. Then I have to restart the xcomfort addon to be able to send commands. Also restarting Home Assistant triggers the same result, receiving is fine but not sending.

dreeas commented 3 years ago

If you believe this is another issue I can of course make a new tread?

See attached logs from restart of HA and restart of only addon. After restarting HA I have to restart the addon once more to get commands to start working. Could it be that when restarting HA the datapoints is not loaded correctly? But still able to receive data? I will try to time how long it takes for the commands to stop working.

xcomfort addon after addon restart.txt xcomfort addon after HA restart.txt

karloygard commented 3 years ago

From experience, when HA is restarted, it takes a while for it to catch up with itself. Wait a little bit and see if that resolves anything.

I'm investigating an issue where the add-on loses connection to the MQTT broker and doesn't recover, but that doesn't look like the issue here.

karloygard commented 3 years ago

I have pushed 0.31 with two different fixes, one for messages that are silently lost, and one which attempts to straighten things out when MQTT disconnects. Test it and let me know if it resolves anything.

dreeas commented 3 years ago

After fixes of MQTT disconnect and retention of messages the addon has been working rock steady for the last days, even after restarts.

austadt commented 3 years ago

Your effort is greatly appretiated! I cannot see any «buy me a coffee button»?

Ive moved my entire installation over to this last night. Roughly 60 dps on a raspberry, and the same directly in a NUC running HA gathering messages from both. While configuring i had an issue, but probably not related, a reboot fixed everything, and it appears to be a lot more robust now. I will let the system «rest» some days to look for bugs, but so far it looks great.

I have a couple of request tho. The temperature inputs are missing battery status – that would have been awesome I have a analog input(CAEE-02/01) – It appears as analog input, but contains nothing – support for this would have been great The EMS actuators wont include kWh entity Heating actuator cannot be controlled – goes in emergency mode, not responding to commands

If its possible to include EMS energy and analog input in a MQTT event (like i use to catch switches, Node red listens on xcomfort/DP/event and evalutes it) thats good enough for me

Med vennlig hilsen Tore Austad Elektriker

Nyvold Installasjon AS -et selskap i Vintervollgruppen Inge Krokanns vei 14 , 7340 Oppdal Telefon: 72400600 Mobil: 92491059 E-post: Tore.Austad@nyinst.nomailto:Tore.Austad@nyinst.no

www.nyinst.nohttp://www.nyinst.no [cid:image003.jpg@01D7100E.366BCCE0] P Tenk på miljøet – vurder behovet før du tar utskrift av e-posten

Fra: dreeas notifications@github.com Sendt: onsdag 3. mars 2021 08:51 Til: karloygard/hassio-addons hassio-addons@noreply.github.com Kopi: Tore Austad Tore.Austad@nyinst.no; Mention mention@noreply.github.com Emne: Re: [karloygard/hassio-addons] Stops sending messages (#12)

After fixes of MQTT disconnect and retention of messages the addon has been working rock steady for the last days, even after restarts.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/karloygard/hassio-addons/issues/12#issuecomment-789514440, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AJF5SOAP5LNDMBRW2EAMRPDTBXS45ANCNFSM4XUCRHGA.

karloygard commented 3 years ago

I have added a «buy me a coffee button» for your benefit now. :)

Glad to hear that it's working well for you guys now. Please keep an eye on it and let me know if you see stability issues!

Regarding the requested changes, create separate issues for them, and I can take a look. Shouldn't be too hard to add support for those as well.

austadt commented 3 years ago

Was playing with my front end just now, and switching two actuators on /off with a minute or two interval. Suddenly none of them responed. One of them running on HA addon, one on te RPI. Sadly, i do not have logs for the RPI, just the addon. Absolutly nothing can be controlled after this incident. I have not updated, running 0.32.

Also, unable to update HA addon, get error. Spins at taking snapshot for a couple of minutes, then this. id i deactivate snapshot I get the same error immidiatly.

Skjermbilde 2021-03-04 204524

Logfile.txt

karloygard commented 3 years ago

Something is wrong with the image. Rebuilding.

karloygard commented 3 years ago

It looks like messages are dropped still, even after I backed off to send only a single message at a time. That is surprising, to be honest. Looks like I have to come up with a different fix.

karloygard commented 3 years ago

The 'NoneType' object has no attribute 'is_file' looks like a bug in the supervisor. Upgrade to supervisor-2021.03.4 and things are well again.

karloygard commented 3 years ago

This issue was never completely resolved, apologies.

The CI appears to sometimes (rarely) completely ignore issued commands. I've issued a fix in 0.35 which resends commands after ten seconds in such cases. Please give it a shot and see if it fixes your issue.