tlaukkan / zigbee4java

Zigbee API for Java provides simple Java interface to ZigBee network.
Apache License 2.0
144 stars 68 forks source link

Failure to reset dongle #37

Closed cdjackson closed 9 years ago

cdjackson commented 9 years ago

I'm occasionally seeing errors when the binding starts where it can't initialise the dongle. I've not looked into this in detail yet but it seems that the dongle doesn't respond to the reset request within the 5 seconds.

I'll look at this further, but just raising an issue in case anyone else has this problem...

cdjackson commented 9 years ago

So, interestingly, it seems the dongle can take quite a while to initialise after it's first plugged in! By this, I mean 1 minute!

ZigBee API starting up...18:51:33 162  TRACE  Using RESEND_TIMEOUT set as DEFAULT 1000
18:51:33 163  TRACE  Using TIMEOUT set as DEFAULT 5000ms
18:51:33 163  TRACE  Using STARTUP_TIMEOUT set as DEFAULT 5000ms
18:51:33 163  TRACE  Using RESEND_MAX_RETRY set as DEFAULT 3
18:51:33 163  TRACE  Using RESEND_MAX_RETRY set as DEFAULT 3
18:51:33 264  TRACE  Initializing hardware.
18:51:33 264  TRACE  CREATED -> HARDWARE_INITIALIZING
18:51:33 277  DEBUG  Opening portIdentifier portIdentifier '/dev/cu.usbmodem1421'.
18:51:33 280  TRACE  Creating ZToolPacketParser
18:51:33 280  TRACE  Provided InputStream class j.extensions.comm.SerialComm$SerialCommInputStream doesn't provide the mark()/reset() feature, wrapping it up as BufferedInputStream
18:51:33 280  TRACE  ZToolPacketParser parserThread started
18:51:33 280  TRACE  Waiting for asynchronous response message 16768.
18:51:33 284  TRACE  The value 254 (fe) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:51:33 284  DEBUG  -> SYS_RESET (Packet: length = 1, apiId = 0x41 0x00, full data = 0xfe 0x01 0x41 0x00 0x01 0x41, checksum = 0x41, error = false, errorMessage = null) 
18:51:38 287  TRACE  Timeout 5000 expired and no packet with 16768 received
18:51:38 287  ERROR  Unable to reset dongle
18:51:38 287  TRACE  HARDWARE_INITIALIZING -> CLOSED

ZigBee API starting up ... [FAIL]
18:52:28 455  TRACE  Read Length  byte, val is base10=6,base16=0x06,base2=00000110
18:52:28 456  TRACE  Read API PROFILE_ID_HOME_AUTOMATION MSB  byte, val is base10=65,base16=0x41,base2=01000001
18:52:28 456  TRACE  The value 128 (80) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 456  TRACE  Read API PROFILE_ID_HOME_AUTOMATION LSB  byte, val is base10=128,base16=0x80,base2=10000000
18:52:28 456  TRACE  Read Remaining bytes 6  byte, val is base10=0,base16=0x00,base2=00000000
18:52:28 456  TRACE  Read Remaining bytes 5  byte, val is base10=2,base16=0x02,base2=00000010
18:52:28 456  TRACE  Read Remaining bytes 4  byte, val is base10=0,base16=0x00,base2=00000000
18:52:28 457  TRACE  Read Remaining bytes 3  byte, val is base10=2,base16=0x02,base2=00000010
18:52:28 457  TRACE  Read Remaining bytes 2  byte, val is base10=6,base16=0x06,base2=00000110
18:52:28 457  TRACE  Read Remaining bytes 1  byte, val is base10=3,base16=0x03,base2=00000011
18:52:28 457  TRACE  The value 194 (c2) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 457  TRACE  Read Checksum  byte, val is base10=194,base16=0xc2,base2=11000010
18:52:28 457  TRACE  The value 128 (80) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 457  TRACE  The value 254 (fe) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 457  TRACE  The value 128 (80) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 457  TRACE  The value 194 (c2) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 457  TRACE  The value 194 (c2) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 457  TRACE  Response is class org.bubblecloud.zigbee.network.packet.system.SYS_RESET_RESPONSE -> Packet: length = 6, apiId = 0x41 0x80, full data = 0xfe 0x06 0x41 0x80 0x00 0x02 0x00 0x02 0x06 0x03 0xc2, checksum = 0xc2, error = false, errorMessage = null
18:52:28 457  TRACE  The value 128 (80) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 458  TRACE  The value 254 (fe) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 458  TRACE  The value 128 (80) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 458  TRACE  The value 194 (c2) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 458  TRACE  The value 194 (c2) is considered a byte because only the 8 least significant bits are set, but its value is outside signed byte that is between -128 and 127
18:52:28 458  DEBUG  <-- SYS_RESET_RESPONSE (Packet: length = 6, apiId = 0x41 0x80, full data = 0xfe 0x06 0x41 0x80 0x00 0x02 0x00 0x02 0x06 0x03 0xc2, checksum = 0xc2, error = false, errorMessage = null)

This is 50 seconds to start, and it seems reproducible if the stick is plugged in and then the app started immediately, then it takes a minute or so to respond. At this point the LED goes out if there is no communication, and it comes back on when the API resets it.

After the first initialisation, it doesn't have this delay.

This is with the official TI dongle - I assume it does the same thing with other stick using the 2531, but thought I'd mention it...

I think the stack needs to be able to handle this case...

andreacioni commented 9 years ago

Hi Chris, I've have your same problem at the first startup of the USB dongle, if you solve it do you think to do a release with this fix? Thanks

presslab-us commented 9 years ago

This is a strange problem. According to API documentation a SYS_RESET_REQ works as follows: This command will reset the device by using a hardware reset (i.e. watchdog reset) if ‘Type’ is zero. Otherwise a soft reset (i.e. a jump to the reset vector) vice is effected.

Type is set to 1 in zigbee4java, so it should jump to the reset vector on every startup of zigbee4java. So apparently a soft reset does not cause the problem because it only times out after the dongle is first plugged in. I would be curious to hear if setting Type to 0 causes the problem on every zigbee4java startup. It makes me wonder if the problem is hardware related; maybe the oscillator is taking a long time to start up.

cdjackson commented 9 years ago

Yes, but I see that on first start, the dongle takes 60 seconds before it will respond to a command after it is first plugged in - this is repeatable (although I've not tried sending any other command admittedly). If I just plug the dongle in but not start the stack, the green LED on the dongle stays on for the 60 seconds, then goes out - after this, it's good to go and it will respond immediately. With the default implementation (ie 5 second timeout) you see that there is a response to the reset after the 60 second timeout - so I guess it's been sitting in the dongles buffer waiting for something.

If no-one objects, then I propose to merge the fix to this since it can't do any harm - it's just increasing the delay....

andreacioni commented 9 years ago

We have temporary solved just by adding a simple retry logic that loop 30 times, every 5 seconds, over the startup method of zigbee api

cdjackson commented 9 years ago

Ok - there's also my proposed fix in the PR referenced above that you can try...

presslab-us commented 9 years ago

Yeah I don't see any problem increasing the timeout in the code. I am just wondering why the dongle behaves like that to begin with, as the soft reset (that happens every zigbee4java startup) does not cause the 60 second wait problem. I don't see this wait problem on my setup, and I am using the official TI firmware. However my hardware is not official, it's the WeBee dongle which is a clone of the TI one. That's why I was suspecting the hardware... I have seen marginal designs with crystal oscillators that just barely start up.

What OS are you using? Maybe it is something in the driver stack. I am using Debian 8 and I see maybe 5 seconds delay before it responds.

cdjackson commented 9 years ago

I’m using the official TI software and hardware. This happens on all my machines - Linux (Debian), Mac OS X and BeagleBone - running Java 6 and 7 across the different systems.

As above - I don’t think it’s the fact that the stick isn’t responding to the soft reset - I think the stick doesn’t respond to anything for 60 seconds after it’s powered on - at least that’s my observation. If I wait the 60 seconds before starting the stack, then the dongle responds immediately to the soft reset, so it’s not a simple case of the soft reset causing the delay.

tlaukkan commented 9 years ago

How many devices you have? Lib makes full network discovery on first startup or if saved state is not available before console starts to accept commands.

Br, Tommi On May 22, 2015 7:16 PM, "presslab-us" notifications@github.com wrote:

Yeah I don't see any problem increasing the timeout in the code. I am just wondering why the dongle behaves like that to begin with, as the soft reset (that happens every zigbee4java startup) does not cause the 60 second wait problem. I don't see this wait problem on my setup, and I am using the official TI firmware. However my hardware is not official, it's the WeBee dongle which is a clone of the TI one. That's why I was suspecting the hardware... I have seen marginal designs with crystal oscillators that just barely start up.

What OS are you using? Maybe it is something in the driver stack. I am using Debian 8 and I see maybe 5 seconds delay before it responds.

— Reply to this email directly or view it on GitHub https://github.com/tlaukkan/zigbee4java/issues/37#issuecomment-104702234 .

cdjackson commented 9 years ago

I’ve got 2 devices at the moment, but this is ‘hanging’ before the library does any discovery. The reset is the first packet that’s sent and it doesn’t respond until the 60 seconds is up - then everything works fine (and quickly).

cdjackson commented 9 years ago

I should also add that I know others have the same problem - some guys I’m in contact with in the in US have the same issue and adding the extra delay (as per my PR) solved it...

tlaukkan commented 9 years ago

Hi

Ok. Has it been verified on the serial interface level that the problem is on the stick side? I.e. after first packet sent the lib is properly reading the serial port and after 60 secs first byte arrives from port? If so it is either bug on the hardware / software of the stick or intended behaviour when setting up new network to ensure that no one else is on the channel in sleeping state or something similar.

Br, Tommi On May 22, 2015 7:29 PM, "Chris Jackson" notifications@github.com wrote:

I should also add that I know others have the same problem - some guys I’m in contact with in the in US have the same issue and adding the extra delay (as per my PR) solved it...

— Reply to this email directly or view it on GitHub https://github.com/tlaukkan/zigbee4java/issues/37#issuecomment-104704872 .

cdjackson commented 9 years ago

Yes - as much as I can verify it anyway…

There’s an LED on the stick - if I plug in the stick, it is not possible to talk to the stick until this LED goes out (which takes exactly 60 seconds). I can’t find anything in the doc about this, but maybe the stick is doing some sort of scan or something, but it’s not associated with a new network since this happens every time it’s plugged in - and the network is well established.

tlaukkan commented 9 years ago

Might need to drop this question to TI forum unless someone is keen on digging into the stack source.

Br, Tommi

cdjackson commented 9 years ago

Good idea - I’ve not looked there previously… I’ll see if I can find somewhere to post as it would be nice to understand it...

Is there any harm in merging the PR? I don’t see any issue since if people don’t want it then the value can be set to 5 seconds… This PR just splits out the timeout on startup so that it can be a different value than all other timeouts...

tlaukkan commented 9 years ago

Hi, I commented on the PR just a minute ago that merge is go from my side.

cdjackson commented 9 years ago

Ok, this is at least partly understood now....

This has been discussed on the TI forum -: https://e2e.ti.com/support/wireless_connectivity/f/158/t/160948

I found the solution. The serial boot loader is waiting for 60 seconds before jumping to the ZNP application image, as documented in Serial Boot Loader document. Sending a magic byte 0xef can force the sbl to skip the waiting. However, the version in Home-1.2.1 changed the magic byte from 0x10 to 0xef, and the document did not update this. That's why I could not get it work before. Also the response to the magic byte has been changed from 0x00 to SYS_RESET_IND. again no document update.

I just looked in the bootloader doc for the 2530, and that says the magic byte is 0x07 - the 2531 bootloader... http://processors.wiki.ti.com/index.php/CC253x_Serial_Boot_Loader

I suspect that some of the 'non TI' sticks will implement this differently. so that's probably why @presslab-us doesn't see the issue (maybe it's a 5 second delay in that version).

I'll merge the PR - we might want to look at another way to speed this up, but as it seems to be dependant on the stick firmware it might not be 100% possible...

cdjackson commented 9 years ago

This issue is understood, and with the latest update should allow up to 60 seconds for a stick to initialise.

presslab-us commented 9 years ago

I guess jumping to the reset vector doesn't rerun the bootloader. On my device I have flashed everything (including the TI bootloader) so I'm not sure why it's acting differently. Specifically I am using the CC2531ZNP-Pro-Secure_LinkKeyJoin.hex firmware from Z-Stack home 1.2.2.

Maybe the best solution is to send the magic bytes (maybe all variants) to kick it out of the bootloader? Would this have any downside to devices that don't need it? It just seems like waiting for 60 seconds would be a pain.

cdjackson commented 9 years ago

I guess jumping to the reset vector doesn't rerun the bootloader. On my device I have flashed everything (including the TI bootloader) so I'm not sure why it's acting differently. Specifically I am using the CC2531ZNP-Pro-Secure_LinkKeyJoin.hex firmware.

I’m also using this firmware - but of course that’s not the boot loader…

Maybe the best solution is to send the magic bytes (maybe all variants) to kick it out of the bootloader? Would this have any downside to devices that don't need it? It just seems like waiting for 60 seconds would be a pain.

I am considering this - but ‘all variants’ can’t protect against any changes in future (TI seem to change it periodically for some stupid reason!). So I think it’s impossible to protect against 100%, but it doesn’t matter too much with the timeout fix I’m implemented. The 60 second wait is only after the device is powered on so it’s not actually a problem in most instances.

presslab-us commented 9 years ago

CC2531ZNP-Pro-Secure_LinkKeyJoin.hex containes the bootloader, look in the HEX file. You will see the bootloader at 0x0-0x1FF. I have compiled the bootloader Projects\zstack\Utilities\BootLoad\CC2531 and the HEX output is identical.

presslab-us commented 9 years ago

Oh and I'm not suggesting to change the increased timeout, I see no real downside to that. Just that adding the magic bytes could fix the problem rather than work around it.

cdjackson commented 9 years ago

Just that adding the magic bytes could fix the problem rather than work around it.

Yep - definitely agree. I’ll take a look at that when I get a chance, but I guess somewhere different implementations might use different values… I’m sure I didn’t have this problem with the older version of the software…

I actually think 1 minute is WAY to long - I don’t know what TI were thinking :)

cdjackson commented 9 years ago

0xef seems to be the magic number on my dongle... Sending this kicks straight through the delay...