[BUG]: Plugwise USB very unstable on HA 2024.2.4

Tigerstrength commented 6 months ago

Describe the bug.

Hi to all!

Last week my stretch got corrupted due to a sudden power-loss at the exact moment of a configuration store. There seems no option te revive the Stretch (factory reset doesn't solve the corrupt files and I am unable to find a Stretch firmware image anywhere to overwrite the corrupted xml file with a stock one). So I pulled the stick from the stretch and switched to Plugwise USB beta integration. It is running for 5 days now, but things are no where to being stable.

I run this environment on my Synology with VMM and these packages:

Core 2024.2.4 Supervisor 2024.02.0 Operating System 12.0 Frontend 20240207.1

What I encounter is that 4 circles where discovered within 1 hour. Then over the next days slowly 1 circle at the time is being added. I have ~15 active circles. The network was rock solid when in use with the stretch and no circles where moved during the migration. Also the USB stick is within 20cm of the location where it was inside the stretch.

After 5 days the integration found 12 circles, but the last took 2 days to be found. The main problem is that the circles randomly show Not available and seem to drop from the network. Even the circle within 30cm of the stick (RSSI -45dbm) and the circle within 30 cm of the Circle+ show this behaviour. Beside that is see huge memory spikes every couple of hours with also huge CPU load. Sometimes my HA installation freezes due to running out of memory. When enabling debug log I see thousands of lines of communication of the stick to the circles and seems completely overwhelming the network and sometimes HA as well.

Is there anything I can do to fix this? Can i provide logging to help debug this issue? Please note that my stick has connected to an other 20 circles in the past, but they are offline for years and function as a backup for the future when plugs are dying due to there age. The stretch did see those circles as offline, but never seemed to be impacted by the fact they where there but being offline.
The usb stick has a stable connection with HA and this is confirmed by the logging marking the stick always as 'ready'.

I am a kind of lost and losing the WAF due this issues :) Please advise on how to continue.

What version of Plugwise Beta are you using?

0.40.6

What was the last working version of Plugwise Beta (or Core if not using beta before)?

No response

What version of Home Assistant Core are you using?

2024.2.4

What type of installation are you running?

Home Assistant OS

How did you install plugwise-beta?

HACS

What kind of Plugwise device is having issues?

USB: Stick

What firmware version is your Plugwise product at?

2011-06-27

Logging

No response

Diagnostics information

No response

Additional information

afbeelding

Tigerstrength commented 6 months ago

log.txt

dirixmjm commented 6 months ago

Can you try bumping the backend to 0.31.3b0 inside the custom_component manifest file? ~/.homeassistant/custom_components/plugwise_usb/manifest.json "requirements": ["plugwise-usb==0.31.3b0"]

restart afterwards.

Tigerstrength commented 6 months ago

Thanks for your reply! I did the requested change (bump to 0.31.3.b0 and reboot). Will monitor the situation and report back soon.

Tigerstrength commented 6 months ago

Still the same behaviour. I disabled Plugwise intergration to double check it is the root cause of this HA freezing due to running out of CPU/Memory. Results will be shared later today. Also it showed only 5 of the 12 found circles online (and missed the remaining 3 still).

Tigerstrength commented 6 months ago

HA is solid from the moment I switch the plugwise integration off (around 4pm), see this graph: afbeelding

I looks very similar to this problem: https://github.com/plugwise/plugwise_usb-beta/issues/10

dirixmjm commented 6 months ago

I must admit my production is still on 2023.12.4, but have PlugWise stick running stable. It takes quite some memory (1 GB), but disabling the integration does not recover that memory usage. I'll try and update of HASS to 2024 shortly and see what it does.

Like in #10 are you using a Raspberry PI?

I have further found that sometimes it helps to reset the network and rebuild (I have done so using the original Source). Cannot promise it will though, and keep in mind that it resets the history counters.

dirixmjm commented 6 months ago

I upgraded to 2024.2.5 (and Python 3.12). I cannot say my network is showing the same issue. Thus the problem should be somewhere else.

Do you have the possibility to rebuild the network using Source? Otherwise it might help to generate a log, however I'd prefer info-level log only, the debug logs are just too big.

Tigerstrength commented 6 months ago

So, I ran HA without plugwise enabled for a full day and can confirm a rock solid performance for the rest of the HA system. That confirms Plugwise-USB is indeed the root cause. I don't use a RPI, but run it on virtual machine manager on a powerful Synology with 8Gb RAM.

Thanks for testing this update on your system @dirixmjm ! May I ask how many plugs you are using in your setup and on what hardware you are running it? I discovered when going into my VM with SSH and running the top command that my memory was critically low the moment Plugwise started. I only had 50-100mb left from the allocated 2gb of RAM. I now doubled the amount of RAM and CPU cores for the VM and will check the results again.

Also I contacted Plugwise to ask for a license key for Plugwise Source. I don't mind losing plug usage history, but I can not reach all of my plugs to reset them aswel (are build in all kind of places). Do the plugs need a reset as wel when using source to reset the network, or will they automatic rejoin afterwards?

dirixmjm commented 6 months ago

I have one network with 12 circles and one network with 4. This used to be some 25 in total including a couple of stealths, but with the state of PW in HAS and failing WAF, I replaced some vital ones with zwave devices.

Initially you could try to just initiate Source, in which case the network will be kept. At some point I did a reset of the C+ after which you need to reset all circles and rebuild. I am not sure whether that or any of this actually helped to get things stable, I also made many changes to the backend code.

I can also provide you with the key if you don't get an answer, just write me an email marc at dirix dot nu.

dirixmjm commented 6 months ago

Forgot to add, I am running on an apu2e4 board with 4GB RAM: https://pcengines.ch/apu2e4.htm

Tigerstrength commented 6 months ago

So your setup is not that different from mine in regards to the amount of circles and computing power (and 4GB mem). That give some hope.

Since doubling the amount off mem from 2 to 4GB the plugin runs longer and discovered all of my currently active circles . I did also remove most of the unused (offline) circles using the remove service from the plugin, which from te logs seems to make the network perform less pingrequests. However I still see memory issues. After a couple of hours the integration seems to eat all the memory at once and crashes HA. After the latest chrash 2 circles are showing offline again.

I have the feeling that there is a memory leak somewhere which is triggered due to huge amounts of data read from the network when using a large setup with a lot of offline plugs aswel. It would help if we had a setting to manage the requested info the stick is querying the circles for. E.g. the interval of which new energy data is pulled and whether a user wants the stick to collect all the kwh usage data. Personally I just need the current power and the switch functionality. HA is more then capable enough to calculate the consumption from that data using the intergration sensor.

Anyway, I will look in the next few days some more and see if the logging provides any clue about the sudden memory leak happening. And I got a license from Plugwise support (credits to them!), so I can look more detailed into my Plugwise network.

dirixmjm commented 6 months ago

You could try disabling the energy consumption logs and see if that helps. In you comment lines 221 until 245 in lib/python3.12/site-packages/plugwise_usb/nodes/circle.py

Tigerstrength commented 6 months ago

Thanks, I will try that later on.

I also have done some new debugging. Following your advise I did a network setup with PW Source. I deleted all but two of my spare circles from the network and ran it for a couple of hours. From Source every circle is found within 5 mins and it stays rock-solid. Then I transferred the stick back to HA and it ran fine for the first two hours. Then again the memory leak and crash, but this time I saw it happening and was able to safe the debug logging just before the crash. I think the cause is visible in this logging. It start sending node requests to one node almost non-stop! This node ( 000D6F0005B10EDA ) is then rediscovered over and over and over again. Meanwhile memory and CPU shoot up and HA crashed within a minute from starting that loop.

I looked up the node type of 000D6F0005B10EDA and it the 2 button wallswitch module which I have and use to locally switch a Cirle. Maybe that is the difference between others installations and tricks the main service loop? It seems like the discovery of this node type is not going as it should. Also this battery type node should not be requested as it will only responds once a day or during a buttonpress.

The runaway loop starts at 2024-03-02 15:03:58.732. Debug log.txt

EDIT: I can confirm the behavior. When pressing a button on one of my remotes the noderequests to that nodeid start nonstop and HA hangs. Strange thing is that I did not use these buttons/switches the last day, but they do send an update on their own every now and then. This might be the cause of all this. I remove all battery's and run an other test.

dirixmjm commented 6 months ago

That makes sense, I don't have any buttons in my network right now. I do have one unit here but no battery. I'll check the log whether I can see an easy fix along next week.

Tigerstrength commented 6 months ago

Almost 24 hours without any hickup, I am pretty hope-full this was the cause. Would be awesome if you can find a fix! Thanks in advance for your effort.

bouwew commented 5 months ago

@Tigerstrength, the lastest version you could test is https://github.com/plugwise/plugwise_usb-beta/releases/tag/v0.40.7a2

Tigerstrength commented 5 months ago

Thanks @bouwew! I switched my test HA system to the 0.40.7a2 beta. Then I reset the swich module and transferred it to my test Plugwise network. I seems not to crash the system anymore, but instead gives these log entry's:

homeassistant | 2024-03-10 22:08:32.038 WARNING (MainThread) [plugwise_usb.connection.receiver] No subscriber to handle NodeSwitchGroupResponse, seq_id=b'FFFF' from 000D6F0000728F6F after 10 retries homeassistant | 2024-03-10 22:08:33.003 WARNING (MainThread) [plugwise_usb.connection.receiver] No subscriber to handle NodeSwitchGroupResponse, seq_id=b'FFFF' from 000D6F0000728F6F after 10 retries

Can you comment some more on the entity changes? I have just finished building my Plugwise USB entity's on my production system and changed all automation to run with that instead of the stretch entity's. I am very careful not to lose those entity's again as it would cost a full day to recover from that.

bouwew commented 5 months ago

@brefra please see above message.

ArnoutD commented 5 months ago

@Tigerstrength With the 0.40.7 you switched from the threaded to the async implementation, have a look at 150 for some background on the testing of @brefra's work on making the library async.

The 'No subscriber to handle' is currently a known (issue) which can be related to duplicate messages send by the Stick as your circle set is 50% bigger and might be more physically spread out then my set (but I have it as well).

For me the implementation is stable but my circle set is "limited" and quite near each other.

brefra commented 5 months ago

The No subscriber to handle xxx warning messages is something I'm investigating. There seems to be some kind of bug (race condition of not switching/releasing tasks in asyncio loop) in current code which I have not be able to resolve yet. Some of those messages (like the ... NodeSwitchGroupResponse ... ones are kind of expected during startup, but shouldn't be reported anymore after a few minutes (when bug is fixed)

When migrating from version the current (threaded) version V0.40.6 to the experimental (async) V0.40.7a1 / V0.40.7a2 version does most of the internal identifiers within HA are renamed. The Entity ID are kept the same. Switching from the experimental version back to the current v0.40.6 version will current break (new entities) I'll see if I can release an update tonight to allow easy transition back.

Tigerstrength commented 5 months ago

@Tigerstrength With the 0.40.7 you switched from the threaded to the async implementation, have a look at 150 for some background on the testing of @brefra's work on making the library async.

The 'No subscriber to handle' is currently a known (issue) which can be related to duplicate messages send by the Stick as your circle set is 50% bigger and might be more physically spread out then my set (but I have it as well).

For me the implementation is stable but my circle set is "limited" and quite near each other.

Please note that is tested this 0.40.7a2 on my test system which only has one circle+, one circle and a switch. So the size of the network can’t be the cause of this ‘no subscriber’ error.

I still have a few questions:

My production system runs on 0.40.6. Is it safe to update this to the newly released 0.40.7 without the risk of losing entities when updating/reverting?
Is every 0.40.7 version async or only the 0.40.7a1/2 versions?
Is the fix for the switch included in 0.40.7 or only in 0.40.7a2?

I am struggling a bit with the versions, as you notice, and don’t want take risks on the production system, which runs solid beside the switch bug mentioned in this topic.

brefra commented 5 months ago

Some explanation:

Safe

0.40.6 => original threaded version 0.40.7 => Same as original threaded version 0.40.6 but has additional support to automatically "downgrades", possible changed internal references from previous installed experimental async versions, back to original ones. This only happens when required

Experimental

0.40.7a1 => First experimental async version. Will automatically upgrade internal references for entities from original version. 0.40.7a2 => 2nd improved experimental async version. Will automatically upgrade internal references too

So basically it's "safe" to upgrade from any version to the experimental ones. Downgrading from experimental back must always by installing 0.40.7 first. Once this version has been started the references are reverted back, and you can downgrade to prior versions (which I do not recommend)

The reason to rename the internal refences of entities is to keep the actual "Entity ID's" (which are used i.e. automations) to be te same..

Tigerstrength commented 5 months ago

Very clear and helpful answer, thanks! I did the update to the safe 0.40.7 option on my production system (test system is still running 0.40.7a2).

So @bouwew, is it possible to implement the switch fix in the normal (not async) version too? Or is the experimental version already deemed stable enough to run on my production system?

brefra commented 5 months ago

I've released a new experimental (async) version 0.50.0a1 earlier today. Most cases of the "No subscriber to handle xxx" warning messages are fixed now.

Tigerstrength commented 5 months ago

I've released a new experimental (async) version 0.50.0a1 earlier today. Most cases of the "No subscriber to handle xxx" warning messages are fixed now.

After the update I get these messages in my log: Logger asyncio.txt

The circles might still be empty in regards of energy logs (just been resetted), so that might be the cause.

The missing subscriber error when pressing the SWITCH unit is gone, so that is indeed fixed.

brefra commented 5 months ago

The circles might still be empty in regards of energy logs (just been resetted), so that might be the cause.

That's normal behavior. Resetting a circle will clear the memory containing the energy information. So it takes up at least an hour (or maybe until midnight) to get energy data.

The missing subscriber error when pressing the SWITCH unit is gone, so that is indeed fixed.

I don't have an switch device myself, but was able to add support for it in the backend library based on protocol information. The HA integration should currently only setup the diagnostics network entities. So I'm curious to see if device is at least discovered. I have not put any effort into adding an "event" entity yet, which would give you the ability to see button presses within HA.

Tigerstrength commented 5 months ago

@brefra, so far it has not discovered any entities for the Switch unit. I assume it should present a new device? Beside the Switch this testnetwork does have two circles and one stealth. They are found, but no sign on the Switch diagnostics network entities. Also the entitylist for Plugwise is not showing these Switch entries (728F6F).

afbeelding

Tigerstrength commented 5 months ago

Is it possible to implement a fix for the Switch causing the plugwise plugin to crash? I still can't fully use my production system as intended due to the Switch modules needing to be offline or else HA crashes. We use the Switch modules at different rooms to control the Plugwise circles directly. Besides HA crashing the local control works fine.

Also I am willing to test the Async versions on my test system, and when deemed stable can upgrade my production environment.

Tigerstrength commented 4 months ago

Is it possible to implement a fix for the Switch causing the plugwise plugin to crash? I still can't fully use my production system as intended due to the Switch modules needing to be offline or else HA crashes. We use the Switch modules at different rooms to control the Plugwise circles directly. Besides HA crashing the local control works fine.

Also I am willing to test the Async versions on my test system, and when deemed stable can upgrade my production environment.

@bouwew , @brefra, please see my question above. Plugwise is running rock solid on latest stable beside the Switch causing it to crash. Is it possible to easily fix this in the non async stable release? We can’t use our Switch modules and I don’t feel comfortable to run the experimental async version on my production environment. 😊

plugwise / plugwise_usb-beta