Unable to control devices after 30-50 minutes

AfromD commented 6 years ago

I have reported this previously in the forum. My Ember stick (nortel) will no longer control any devices after running for less than an hour. It will still register changes on the network. The following log shows me commanding brightness on a lamp to zero at 11:13:33. In Paper UI, the brightness updates to 0, however, the lamp itself does not react. Meanwhile, you can see a motion sensor updating, which is acted upon correctly by openhab.

Properties of the zigbee thing I tried to control:

zigbee_logicaltype	ROUTER
zigbee_powerlevel	FULL
modelId	LCT010
zigbee_networkaddress	27819
zigbee_powersource	MAINS
zigbee_stkversion	1
zigbee_datecode	20160906
zigbee_zclversion	1
zigbee_routes	[]
zigbee_lastupdate
vendor	Philips
zigbee_appversion	2
zigbee_permitjoining	false
zigbee_powermode	RECEIVER_ON_IDLE
zigbee_powersources	[MAINS, RECHARGABLE_BATTERY, DISPOSABLE_BATTERY]
hardwareVersion	1
zigbee_neighbors	[]
firmwareVersion	01000B0E
zigbee_devices	[]

New Text Document.txt

AfromD commented 6 years ago

This is how a functioning command turns up in the log. All the TX commands seem to missing from above posted log. Neues Textdokument.txt

cdjackson commented 6 years ago

I prefer to close this here. Please open an issue on the library.

cdjackson commented 6 years ago

Sorry - I'm going to re-open this here as I'm not convinced this is an ember issue at the moment.

Please can you try and get a log that shows the binding working fine, then TX stopping - ie I'd like to try and catch the point at which it stops sending.

5iver commented 6 years ago

I have been running 2.3.0.201801092213 with an HUSBZB-1 without issues, but I upgraded to OH snapshot 1181 today and grabbed the most recent Zigbee binding 2.3.0.201801131507. It looked like everything was normal, but it was daytime, so not many lights on. I noticed bulbs not coming on and checked OH to find no bulbs responding and some big drooly exceptions in Karaf. I turned on DEBUG, but nothing was coming up. Here is a log of a full startup. No bulbs were responding, and only about half came online. I rolled back to the previous build I was using and everything is working great again.

011318 Karaf exceptions.log 011318 zigbee nonresponsive.zip

I don't know if this is related to the issue reported, but was the most similar open issue.

cdjackson commented 6 years ago

The ember driver hasn't changed for a while so I'd like to say it's not directly related, but... There is probably a small memory leak in the ember driver and it's always been there. The log looks similar and also similar to some of the logs I've seen from you a while ago, so hopefully the fix on Telegesis will also be the same here.

5iver commented 6 years ago

Well, something definitely happened between these two versions to make the binding completely unusable! After several restarts of the binding and OH, I could not get any devices to come up with 2.3.0.201801131507. Maybe it was the OH upgrade? But 2.3.0.201801092213 is working great.

cdjackson commented 6 years ago

Very strange then since the only changes between the two versions are in the Telegesis driver (from a look on Github history)...

cdjackson commented 6 years ago

The latest version has the concurrency fix in it that appears to have solved the issues in the Telegesis binding. Let me know if it's any better with Ember (and please provide an updated log if not).

5iver commented 6 years ago

OH: 2.3.0 snapshot 1181 Zigbee: 2.3.0.201801141204

I stopped Zigbee 2.3.0.201801092213 and AshHandler logs kept coming (I'll get logs if I see this again), so I stopped OH. Updated to the latest jar and turned on debug. After restarting OH and manually starting Zigbee, in about 6 minutes the binding hung without errors that I could see. At no time were devices controllable, but some did come online. You'll see at the end of the log after it looked hung, I tried to turn on and off a bulb (through Habmin) but nothing happened. I had tried a couple time before it completely hung too.

zigbee.zip

cdjackson commented 6 years ago

Looks the same as we’ve seen previously with the TX Queue increasing over time - this is what I’d hoped was fixed with the concurrent queue change :(

5iver commented 6 years ago

In those reports, weren't the devices controllable before they became unresponsive?

cdjackson commented 6 years ago

Yes, probably, but it’s the same here I think… I know you’ll probably say it’s not ;), but in the logs I see the TX queue increasing slowly while commands are going back and forward, so it looks the same - possibly it’s happening faster though.

(nearly!) The only actual commands (i.e. an item state change command from you) is sent after the tx queue had already died. There is one command to node 7CE524000013F7EF that doesn’t have a handler - possibly because it’s not initialised when the command comes in by the looks of it.

5iver commented 6 years ago

OK, after another OH restart, I can control bulbs! And it's been about 13 minutes. It takes about 10s for them to respond though. Hopefully will improve. But at least it isn't consistently non responsive, which I thought was the case.

[EDIT} they are responding much faster now. And all bulbs came online!

cdjackson commented 6 years ago

@openhab-5iver The last failure log is very strange. It shows 'proper' frames are being sent at ASH level even after the higher layers have (apparently!) stopped sending -:

The following is a list filtered on TX CMD: which shows they stop, and yet the TX ASH Frame continues (I've not shown the earlier TX ASH Frame entries for clarity - they do exist.

Not quite sure what this means - maybe the TX queue is filling up, the higher layers stop sending, and then the low layers continue. It's not so easy to correlate all the requests to see if that's the case. Even if it is, I'm not really sure why.

I'll put some more debug in here.

2018-01-14 09:15:26.788 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNetworkManager     ] - TX CMD: ReadAttributesCommand [Basic: 0/0 -> 46573/1, cluster=0000, TID=D3, identifiers=[1]]
2018-01-14 09:15:29.779 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNetworkManager     ] - TX CMD: ReadAttributesCommand [Basic: 0/0 -> 57849/1, cluster=0000, TID=D4, identifiers=[6]]
2018-01-14 09:15:29.784 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNetworkManager     ] - TX CMD: ManagementRoutingRequest [0/0 -> 19104/0, cluster=0032, TID=D6, startIndex=0]
2018-01-14 09:15:29.783 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNetworkManager     ] - TX CMD: ReadAttributesCommand [OTA Upgrade: 0/0 -> 4547/1, cluster=0019, TID=D5, identifiers=[2]]
2018-01-14 09:15:29.787 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNetworkManager     ] - TX CMD: ReadAttributesCommand [Basic: 0/0 -> 24779/1, cluster=0000, TID=D7, identifiers=[6]]
2018-01-14 09:15:29.790 [DEBUG] [com.zsmartsystems.zigbee.ZigBeeNetworkManager     ] - TX CMD: ReadAttributesCommand [Basic: 0/0 -> 40893/1, cluster=0000, TID=D8, identifiers=[4]]
...
2018-01-14 09:15:53.418 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameAck [ackNum=6]
2018-01-14 09:15:53.419 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameData [frmNum=6, ackNum=1, reTx=false, data=D3 90 45 00 04 01 00 00 01 01 40 00 00 00 0F FE B7 ED B5 FF FF 08 08 D3 01 01 00 00 20 04 02]
2018-01-14 09:15:53.423 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameAck [ackNum=7]
2018-01-14 09:15:53.538 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=1, ackNum=7, reTx=true, data=D1 00 34 00 4B 2F 00 00 01 00 00 00 40 11 00 00 D1 D1 05 00 4B 2F 01 00]
2018-01-14 09:15:53.660 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - TX EZSP: EzspSendUnicastRequest [type=EMBER_OUTGOING_DIRECT, indexOrDestination=57849, apsFrame=EmberApsFrame [profileId=260, clusterId=0, sourceEndpoint=1, destinationEndpoint=1, options=[EMBER_APS_OPTION_ENABLE_ADDRESS_DISCOVERY, EMBER_APS_OPTION_RETRY, EMBER_APS_OPTION_ENABLE_ROUTE_DISCOVERY], groupId=0, sequence=212], messageTag=212, messageContents=00 D4 00 06 00]
2018-01-14 09:15:53.661 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=2, ackNum=7, reTx=false, data=D4 00 34 00 F9 E1 04 01 00 00 01 01 40 11 00 00 D4 D4 05 00 D4 00 06 00]
2018-01-14 09:15:53.661 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameData [frmNum=7, ackNum=2, reTx=false, data=D1 80 34 00 1E]
2018-01-14 09:15:53.662 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameAck [ackNum=0]
2018-01-14 09:15:53.783 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=2, ackNum=0, reTx=true, data=D4 00 34 00 F9 E1 04 01 00 00 01 01 40 11 00 00 D4 D4 05 00 D4 00 06 00]
2018-01-14 09:15:53.941 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - TX EZSP: EzspSendUnicastRequest [type=EMBER_OUTGOING_DIRECT, indexOrDestination=19104, apsFrame=EmberApsFrame [profileId=0, clusterId=50, sourceEndpoint=0, destinationEndpoint=0, options=[EMBER_APS_OPTION_ENABLE_ADDRESS_DISCOVERY, EMBER_APS_OPTION_RETRY, EMBER_APS_OPTION_ENABLE_ROUTE_DISCOVERY], groupId=0, sequence=213], messageTag=214, messageContents=00 00]
2018-01-14 09:15:53.941 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=3, ackNum=0, reTx=false, data=D6 00 34 00 A0 4A 00 00 32 00 00 00 40 11 00 00 D5 D6 02 00 00]
2018-01-14 09:15:53.942 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameData [frmNum=0, ackNum=3, reTx=false, data=D4 80 34 00 1F]
2018-01-14 09:15:53.943 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameAck [ackNum=1]
2018-01-14 09:15:53.944 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameData [frmNum=1, ackNum=3, reTx=false, data=D4 90 3F 00 F9 E1 04 01 00 00 01 01 40 11 00 00 1F D4 00 00]
2018-01-14 09:15:53.945 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameAck [ackNum=2]
2018-01-14 09:15:53.947 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameData [frmNum=2, ackNum=3, reTx=false, data=D4 90 45 00 04 01 00 00 01 01 40 00 00 00 56 FF B9 F9 E1 FF FF 10 08 D4 01 06 00 00 42 08 32 30 31 34 30 38 31 32]
2018-01-14 09:15:53.959 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameAck [ackNum=3]
2018-01-14 09:15:54.062 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - --> TX ASH frame: AshFrameData [frmNum=3, ackNum=3, reTx=true, data=D6 00 34 00 A0 4A 00 00 32 00 00 00 40 11 00 00 D5 D6 02 00 00]
2018-01-14 09:15:54.222 [DEBUG] [artsystems.zigbee.dongle.ember.ash.AshFrameHandler] - <-- RX ASH frame: AshFrameData [frmNum=3, ackNum=3, reTx=false, data=D4 90 45 00 00 00 01 80 00 00 40 00 00 00 B5 FF B8 4B 2F FF FF 0D 00 00 D6 5A 14 00 00 24 E5 7C 4B 2F 00]

AfromD commented 6 years ago

I am still on the binding version of the 9th of January. Windows updating always goes wrong, I need to do clean installs, so I am a bit slow updating versions. I made a debug log, with a rule switching an outlet on the top of each minute. You will see successful switching at 19:59:00 and 20:00:00, and failing ones at 20:01:00 and 20:02:00. That should narrow it down quite a bit. Neues Textdokument (2).txt

5iver commented 6 years ago

Made it through about 5 hours, but everything stopped again. I updated to 2.3.0.201801141430, so we'll see what it does. But never had these issues on 2.3.0.201801092213...

011418 zigbee unresponsive.zip

cdjackson commented 6 years ago

On 14 Jan 2018, at 21:42, Scott Rushworth notifications@github.com wrote:

But never had these issues on 2.3.0.201801092213..

I guess you got lucky ;)

cdjackson commented 6 years ago

@openhab-5iver can you try the following version -: org.openhab.binding.zigbee_2.2.0.201801142304.jar. It's in my normal download folder where you download other test versions.

cdjackson commented 6 years ago

@AfromD and @openhab-5iver how many devices do you have in your networks?

5iver commented 6 years ago

I have 25 GE Link bulbs (4 are not powered on and sitting on a shelf), 1 Centralite outlet, and one Centralite keypad which doesn't have the batteries in it. Installing test binding...

5iver commented 6 years ago

About 4 hours later on the test binding and devices are still responding. Only issue I've seen is that one of the bulbs did not come online.

AfromD commented 6 years ago

I have a Hue lightstrip, Osram outlet, Hue white&color bulb and Hue motion sensor that are always online. On switched outlets (while unreliable) I have two more Hue white&color and two Trust zigbee color lamps. So these come and go on the network Still fighting with three ST motion sensors, that might be another bug report later on.

cdjackson commented 6 years ago

I suspect that this issue is not related to Ember - I think it's probably related to large numbers of devices, and maybe if there are devices that are powered down, this will likely make things worse as the system will be trying to contact them when they are not online.

I'm going to change the title of this to remove Ember - I'll keep an open mind just in case it is still Ember specific, but I don't think we can draw that conclusion from what I've seen in the logs.

For information/interest, yesterday I wrote a kind of stress test which sent a message from 4 threads every 60 to 150ms (different times in each thread to stress any concurrency issues that might be lurking) and this ran for ages without any issue with the Ember driver.

The change that I made to the version @openhab-5iver tested was to queuing in the higher layers - I reduced the number of threads that are active for the mesh update which will have the effect of reducing the number of queued frames. I might change the way this works, or reduce it further as a temporary measure since it seems to have adverse effects in larger networks.

5iver commented 6 years ago

Now about 14 hours on the test version and devices are still responding. Only thing I've noticed is that response time is a few seconds, like I'm using a cloud service rather than local. And one bulb is still not online. I don't know if you had done anything to logging, but there's about 500MB every 5 hours!

AfromD commented 6 years ago

With the 1801211741 binding, now running for 14 hours and still working.

AfromD commented 6 years ago

24 hours ok now. Need to restart for another reason though, and will be playing around with OH next few days, so will take a few days before I will get longer runs.

5iver commented 6 years ago

OH 2.3.0 snapshot 1197 Zigbee 2.3.0.201801252322

Been running fine for a couple days, but just noticed devices not responding. Was not in debug when it started, but here are some logs while trying to control some bulbs after I noticed it.

012818 zigbee nonresponsive.log

cdjackson commented 6 years ago

Not too much in the log to show what happened. The low level ember communications is still working fine, but there are no commands being transmitted from the higher layers. I've seen this in some other logs - not sure why this is happening at the moment.

AfromD commented 6 years ago

It has happened again for me as well. I'll try to get a log of when it happens.

AfromD commented 6 years ago

I have a fancy new logviewer (logexpert). I noticed that at some point the TX CMDs simply stop. I think this is where the devices can no longer be controlled. Here it happens around 2:11:03 zigbee log.txt BTW Version 2.3.0.201802052244

cdjackson commented 6 years ago

I noticed that at some point the TX CMDs simply stop.

Yes, this is as we've seen before in the logs but there's no indication why. Lower layers are still working ok...

AfromD commented 6 years ago

Thanks for the feedback on the forum. Lets continue discussion about this particular problem here. One peculiarity is that a while ago, when the issue was some queue filling up, I was able to get things working again by restarting the binding. This is no longer so. After TX stops, and I try to restart the binding, everything will just go offline. Only a restart of OH will fix the problem. I just tried restarting the zigbee binding while functioning, and that doesn't cause any problem. That makes me think the problem is somewhere higher up in OH. It is like some of the mechanisms the binding uses stop working, and a reset of the binding doesn't fix that. (It seems like it even prevents the binding from restarting/disconnecting it from OH) I have experimented with resetting nrjavaserial, but that doesn't help. My next idea would be to go through resetting the bundles listed under bundle:requirements until I hit one that makes things work again.

cdjackson commented 6 years ago

Are you now using the latest test binding I posted?

AfromD commented 6 years ago

I am now on the 2.2 binding you posted in the forum. So far working for 9 hours straight. Doesn't see my Hue motion sensor.

AfromD commented 6 years ago

This problem seemed fixed with the 2.2.0.201802111711 version. On version 2.2.0.201802192145 it seems to have returned. Problem appeared after approx 14 hours runtime. There are no TX commands any more after 06:28:58, although commands are still getting through to the binding, and incoming updates received from the motion sensor are still working.

Dongle and devices all show as online. An attempted restart of nrjavaserial and the zigbee binding via karaf lead to devices showing as offline and the dongle as unknown. On a full restart, with version 2.2.0.201802111711, stopping OH as a whole took mere seconds. Previously, and also with the current version, it takes about a minute to stop OH. It seems some component doesn't stop properly any more.

When the binding is still functioning, the restart of the binding is without issue, and also OH stops without hanging

New Text Document (2).txt

AfromD commented 6 years ago

Shall I maybe go back to version 2.2.0.201802111711, and see whether the problem was really fixed? I did a lot of restarting due to trying to get my motion sensor to work, so the runs either weren't very long, or were while is was away and the network probably was pretty quiet. I am currently experimenting with some rules, which did send lots of commands yesterday, so if there is some memory leak, array out of bounds etc, it might have gone quicker because of that. So there is a chance that 2.2.0.201802111711 has the same behaviour, but that it just didn't do enough work to provoke it.

cdjackson commented 6 years ago

I don't believe that the issue will be fixed in the previous version. As said somewhere previously, I believe this issue is related to loading. I need to implement a transaction management layer in the library to manage this.

There is very little change between the two versions you quote. It fixes some issues with Ember retries and some persistence issues, but it really won't have any significant impact on this issue.

AfromD commented 6 years ago

With the 2.3.0.201802261341 binding, this has been running quite stable. Running for days now. Had it running with a light animation rule, which fired off around 40 commands per minute for an hour, and afterwards everything was still ok. So for now, this is running stable enough, and I am confident I don't need regular restarts any more. I think the issue can be closed.

cdjackson commented 6 years ago

Ok, thanks. I'll close this. I think there are still some related issues, but there are a couple of similar issues open as well...

openhab / org.openhab.binding.zigbee

Unable to control devices after 30-50 minutes #86