Closed kennethlimcp closed 9 years ago
Okay, I'm suspecting a handshake race condition if the core is sending the send_time_request too soon during / after - https://github.com/spark/core-firmware/blob/9b01d795ecaaf657399dc4888f30f3898d91d74a/src/spark_utilities.cpp#L544
I'm not able to reproduce the counter race condition here, can someone who can reproduce the issue test this theory?
The issue is very persistent for me, even with the Tinker firmware. I posted a video in that thread, showing the behaviour.
What do I need to do to test your theory? Insert a delay somewhere? I (blindly) tried inserting 10, 100 and 1000ms delays before and after the call to send_time_request()
. I also tried various delays (using for loops) at each stage of SparkProtocol::handshake()
itself. None of it seemed to help at all.
(That process did at least confirm that handshake is returning with err=0
, as expected. shrug)
EDIT: I've since also tried various delays in the server-side protocol state machine (spark_protocol.cpp
). Nothing seems to help. I'm guessing therefore that it's not a race condition ... though I too cannot see what else it could be. Frustrating, this one :-/
Thanks.
P.S: I have a JTAG debugger, but I have no idea how to debug code outside the core-firmware folder -- like in core-communication/
, for ex. :-(
Just for fun, I tried an old, presumably slower and certainly different WiFi access point. No change.
I then tried connecting the spark-server host via WiFi as well, instead of Gb Ethernet, to introduce more latency. Still no change.
The problem is still very, very persistent for me, even with stock standard Tinker firmware -- even with introduced delays in code at both client and server and even with different local network configurations and latencies. I'm see no changes what-so-ever in system behaviour, throughout all of that. Remember that I do get a good connection, eventually. So my configuration (keys, etc) must surely be sound.
Wow. Is this bizarre or what? :-/
Hmm. Well, the counter error in the server logs could indicate a race condition, but if you're not seeing that, then it's less clear what the issue might be. I think I'll modify the spark-protocol module to include some optional extremely verbose logging, and then you can switch that on and we can see exactly what messages are being exchanged. :)
@gruvin,
It's probably still low latency :p I helped a guy troubleshoot few weeks back with spark-server hosted on a mac-platform he rented for a month.
I never had that issue with user code not running at all ;)
@kennethlimcp .... well, maybe. But I've inserted delays all over the place in the code on both ends of the link and I cannot get anything to change (except when I make the delay too long and get timeouts, obviously.)
@dmiddlecamp ... I've already done some of that, starting with uncommenting the existing debug log messages. Far as I can see (given that I'm only interpreting what is correct by reading the code, which is somehow faulty) all the correct messages are actually being sent and in the correct order. Never the less, whilst the server is happy that the handshake completed successfully, the 'core is hard faulting straight afterwards.
Low network latency doesn't really explain why a connection is finally established without error, after about 3 minutes 30s of 'core self-reboots, most every time, far as I have witnessed so far.
@dmiddlecamp ... So, what more I can do to debug this end, since the bug is ultra persistent for me? I have JTAG debugger and a fair understanding of C++ and JS. (For the record, I'm a 43 year old, self taught electronics and computer hobbyist since 1978. I soldered my first Z80 based computer together in 1987, age 17. Had to build my own EPROM programmer and al that good stuff. Alas, no formal training and feeling it! LOL)
If you wanted, I could set up a remote SSH shell for you, using Skype or something to watch the core. I could even hook up a USB interface to allow you to remotely hard reset the core, put it in DFU mode, etc, if you want to go that far. (Yes, I'm serious.)
EDIT: I've now done all that. An ssh account on my system exists and a Sparkcore is ready for remote reset/dfu mode control, using a couple of shell scripts. Build environment is installed and ready for you. PM me for login and usage details, if/when you're keen. Skype is set to auto-answer with video (of my Sparkcore).
nice, thanks! :) I'll try to see if we can get someone to remote in and check it out. Thanks again.
I have this problem too. I've found that restarting the local server encourages the spark to connect. Once connected it seems stable.
Hey guys! It seems like the latest spark-tinker in the spark-cli
would create the SOS behavior. Can someone flash it and verify?
This two tinker files should work well with the local cloud:
https://github.com/spark/core-firmware/releases/tag/spark_3 and flash the core-firmware.bin in the build folder and observe what happens.
Also, try this: https://github.com/spark/spark-cli/blob/10f65d1eba553df807457a5b90388704ea14b2bf/js/binaries/spark_tinker.bin
If this is indeed true, we can now focus on reviewing which new addition of code in firmware version 0.20 onwards are causing issues.
The core-firmware.bin from the spark_3 worked first time on a local cloud :)
binaries in spark_4 5 and 6 worked fine too. spark_7 has problems
git checkout -f spark_XXX (cd build/; sudo dfu-util -d 1d50:607f -a 0 -s 0x08005000:leave -D core-firmware.bin)
I tried to bisect between 6 and 7 but it's quite hard to know where to checkout core-common but I'm getting build failures and it's gone midnight...
It's really weird that the firmware has some effect on the connection. We should really turn on debug and see the logs for what's going on...
I have also tested binaries from spark_4/5/6, which were fine. The one from spark_7 is not.
NOTE HOWEVER that if I locally compile from spark_3/4/5/6, the resulting binary is NOT OK. We have also just had confirmation that the build tools at Spark were recently updated (between spark_6 and spark_7) -- most likely to the same version I have, which I believe to be the latest available. @Dave (community forums handle) is currently working to resolve the cause, while others try to provide clues.
My compiler version: gcc version 4.8.3 20131129 (release) [ARM/embedded-4_8-branch revision 205641] (GNU Tools for ARM Embedded Processors)
There's a thread here with (a tonne) more info. But it's quite noisy over there, with a lot of observational data and guessing going on (mostly by me. ;-)
@kennethlimcp ... Thanks to that other post of yours, I think I may be able to get debug mode working now. Will try to work on that later today.
@gruvin, the Debug bug fix should be in spark_7 already. You can attempt to compile locally for spark_7 and watch the debug log perhaps?
@kennethlimcp ... yup. That worked. Hasn't shown us much, except perhaps that an unexpected SOCKET_STATUS_INACTIVE
occurs before the hard fault.
I tried inserting calls to DEBUG(...) in core-communication-lib/src/spark_protocol.cpp
. But the DEBUG macro is not available there and I don't know how to make it so. (I tried.) :-(
Here's the DEBUG output I do see, anyway ...
0000000001:<DEBUG> int main() (159):Hello from Spark!
0000003944:<DEBUG> int Spark_Connect() (750):sparkSocket Now =-1
0000003945:<DEBUG> int Spark_Disconnect() (822):
0000003998:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_ACTIVE
0000004086:<DEBUG> int Spark_Connect() (757):socketed sparkSocket=0
0000004158:<DEBUG> int Spark_Connect() (812):connect
0000004224:<DEBUG> int Spark_Connect() (814):connected connect=0
0000004286:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 40
0000004465:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 384
0000004840:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000004863:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
0000004951:<PANIC> void HardFault_Handler() (88):HardFault
This is from tag:spark_8
(latest master, stock Tinker firmware), built using ...
make DEBUG_BUILD=y
The corresponding server log screen shows ...
Connection from: 192.168.1.136, connId: 112
on ready { coreID: '53ff6d065067544853360587',
ip: '192.168.1.136',
product_id: 65535,
firmware_version: 65535,
cache_key: undefined }
Core online!
I am trying to get some DEBUG output from a successful 'core connect. In DBUG mode, the firmware does not auto-reboot after a hard fault. So it's proving tedious. I'll keep trying for a while and let you know here if I succeed.
hmm... I'm not at all expert in this :D but it leads me to:
sounds like the socket got closed too early? I have no clue..
@gruvin, can you format the debug output above?
Got it, finally. Here's the DEBUG output from a successful connect. Clearly, things go a lot further, though apparently not in quite the straight line expected ...
0000000001:<DEBUG> int main() (159):Hello from Spark!
0000002940:<DEBUG> int Spark_Connect() (750):sparkSocket Now =-1
0000002941:<DEBUG> int Spark_Disconnect() (822):
0000002994:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_ACTIVE
0000003082:<DEBUG> int Spark_Connect() (757):socketed sparkSocket=0
0000003154:<DEBUG> int Spark_Connect() (812):connect
0000011210:<ERROR> hci_event_handler (258):Timeout now 11210 start 3210 elapsed 8000 cc3000__event_timeout_ms 8000
0000011263:<ERROR> hci_event_handler (259):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000011368:<DEBUG> int Spark_Connect() (814):connected connect=-1
0000011438:<DEBUG> int Internet_Test() (703):socket
0000011493:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000011582:<DEBUG> int Internet_Test() (705):socketed testSocket=1
0000011652:<DEBUG> int Internet_Test() (727):connect
0000011758:<DEBUG> int Internet_Test() (729):connected testResult=0
0000011780:<DEBUG> int Internet_Test() (739):Close
0000011880:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
0000011925:<DEBUG> int Internet_Test() (741):Closed rv=0
0000015006:<DEBUG> int Spark_Connect() (750):sparkSocket Now =0
0000015007:<DEBUG> int Spark_Disconnect() (822):
0000015058:<DEBUG> int Spark_Disconnect() (831):Close
0000015116:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_INACTIVE
0000015206:<DEBUG> int Spark_Disconnect() (833):Closed retVal=0
0000015274:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_ACTIVE
0000015363:<DEBUG> int Spark_Connect() (757):socketed sparkSocket=0
0000015435:<DEBUG> int Spark_Connect() (812):connect
0000015494:<DEBUG> int Spark_Connect() (814):connected connect=0
0000015562:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 40
0000015741:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 384
0000016116:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000016138:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
0000016230:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 2
0000016313:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 16
0000016401:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 2
0000016484:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 16
It looks like some improperly closed socket for a beginner like me... @technobly, jump in! :)
It's odd that the working connection is the one that had the most reported problems getting there, starting with ...
0000011210:<ERROR> hci_event_handler (258):Timeout now 11210 start 3210 elapsed 8000 cc3000__event_timeout_ms 8000
0000011263:<ERROR> hci_event_handler (259):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
And the final two lines in the working DEBUG output, just before the hard fault in the non-working one, are identical.
0000016116:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000016138:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
Beats the coffee beans out of me. Seems to be either a weird race condition or something screwy with the new version of the compiler. A curly one, by any measure.
I just found a race condition with current versions of the core firmware that are exacerbated on the local server, and I believe this server-side workaround should help significantly while we're waiting on the firmware fix to rollout to everyone. I'm hoping this will fix this issue as well. I'll test it some more, and then push to npm when it's ready. -- please free to grab the master branch of spark-protocol in the meantime if you'd like to help test. :)
Thanks! David
@dmiddlecamp, might need you to describe the race condition more that you found which you patched with https://github.com/spark/spark-protocol/commit/d615ae1eff4d3ea23c0420cee048ef74147d3743.
I'm not sure how to test for this race condition so more information would be required.
As of now, i tested that if this fix is not used (https://github.com/spark/core-firmware/commit/5e18860a47372fe81a433ef4847eced90c1e4d80) we will see SOS flashes still even with the new patch you added.
Hmm, it's possible that the core firmware is simply sending too much too fast during the startup / handshake procedure with the addition of the UDP multicasts, and that's causing it to crash.
The spark-protocol patch fixes a race condition that caused premature messages from the core to not be handled in time on the server, which caused the session with that core to be dropped.
I only saw the BOOM early data
once out of the few connections during testing. Probably spark cloud has tons of users while i am testing on just one so it might not be obvious.
Maybe the broadcast is a separate issue from this but it's still hard to tell.
Let me know how i can help further!
I haven't been able to reproduce this issue with current versions of firmware and spark-server. If anyone is still having trouble with this please feel free to reopen the issue.
Seems like besides
tinker app
, running other firmware results in SOS panic red flashes.Looks like 2 flash - Non-maskable interrupt fault
Flashing via OTA or DFU resulted in the same behavior.
I will be loading the code in via DFU, test on the LC and switch to Spark Cloud and see what's the result like.