particle-iot / spark-server

UNMAINTAINED - An API compatible open source server for interacting with devices speaking the spark-protocol
https://www.particle.io/
GNU Affero General Public License v3.0
441 stars 136 forks source link

red flashes with user firmware #18

Closed kennethlimcp closed 9 years ago

kennethlimcp commented 10 years ago

Seems like besides tinker app, running other firmware results in SOS panic red flashes.

Looks like 2 flash - Non-maskable interrupt fault

Flashing via OTA or DFU resulted in the same behavior.

I will be loading the code in via DFU, test on the LC and switch to Spark Cloud and see what's the result like.

dmiddlecamp commented 10 years ago

Okay, I'm suspecting a handshake race condition if the core is sending the send_time_request too soon during / after - https://github.com/spark/core-firmware/blob/9b01d795ecaaf657399dc4888f30f3898d91d74a/src/spark_utilities.cpp#L544

dmiddlecamp commented 10 years ago

I'm not able to reproduce the counter race condition here, can someone who can reproduce the issue test this theory?

gruvin commented 9 years ago

The issue is very persistent for me, even with the Tinker firmware. I posted a video in that thread, showing the behaviour.

What do I need to do to test your theory? Insert a delay somewhere? I (blindly) tried inserting 10, 100 and 1000ms delays before and after the call to send_time_request(). I also tried various delays (using for loops) at each stage of SparkProtocol::handshake() itself. None of it seemed to help at all.

(That process did at least confirm that handshake is returning with err=0, as expected. shrug)

EDIT: I've since also tried various delays in the server-side protocol state machine (spark_protocol.cpp). Nothing seems to help. I'm guessing therefore that it's not a race condition ... though I too cannot see what else it could be. Frustrating, this one :-/

Thanks.

P.S: I have a JTAG debugger, but I have no idea how to debug code outside the core-firmware folder -- like in core-communication/, for ex. :-(

gruvin commented 9 years ago

Just for fun, I tried an old, presumably slower and certainly different WiFi access point. No change.

I then tried connecting the spark-server host via WiFi as well, instead of Gb Ethernet, to introduce more latency. Still no change.

The problem is still very, very persistent for me, even with stock standard Tinker firmware -- even with introduced delays in code at both client and server and even with different local network configurations and latencies. I'm see no changes what-so-ever in system behaviour, throughout all of that. Remember that I do get a good connection, eventually. So my configuration (keys, etc) must surely be sound.

Wow. Is this bizarre or what? :-/

dmiddlecamp commented 9 years ago

Hmm. Well, the counter error in the server logs could indicate a race condition, but if you're not seeing that, then it's less clear what the issue might be. I think I'll modify the spark-protocol module to include some optional extremely verbose logging, and then you can switch that on and we can see exactly what messages are being exchanged. :)

kennethlimcp commented 9 years ago

@gruvin,

It's probably still low latency :p I helped a guy troubleshoot few weeks back with spark-server hosted on a mac-platform he rented for a month.

I never had that issue with user code not running at all ;)

gruvin commented 9 years ago

@kennethlimcp .... well, maybe. But I've inserted delays all over the place in the code on both ends of the link and I cannot get anything to change (except when I make the delay too long and get timeouts, obviously.)

@dmiddlecamp ... I've already done some of that, starting with uncommenting the existing debug log messages. Far as I can see (given that I'm only interpreting what is correct by reading the code, which is somehow faulty) all the correct messages are actually being sent and in the correct order. Never the less, whilst the server is happy that the handshake completed successfully, the 'core is hard faulting straight afterwards.

Low network latency doesn't really explain why a connection is finally established without error, after about 3 minutes 30s of 'core self-reboots, most every time, far as I have witnessed so far.

gruvin commented 9 years ago

@dmiddlecamp ... So, what more I can do to debug this end, since the bug is ultra persistent for me? I have JTAG debugger and a fair understanding of C++ and JS. (For the record, I'm a 43 year old, self taught electronics and computer hobbyist since 1978. I soldered my first Z80 based computer together in 1987, age 17. Had to build my own EPROM programmer and al that good stuff. Alas, no formal training and feeling it! LOL)

If you wanted, I could set up a remote SSH shell for you, using Skype or something to watch the core. I could even hook up a USB interface to allow you to remotely hard reset the core, put it in DFU mode, etc, if you want to go that far. (Yes, I'm serious.)

EDIT: I've now done all that. An ssh account on my system exists and a Sparkcore is ready for remote reset/dfu mode control, using a couple of shell scripts. Build environment is installed and ready for you. PM me for login and usage details, if/when you're keen. Skype is set to auto-answer with video (of my Sparkcore).

dmiddlecamp commented 9 years ago

nice, thanks! :) I'll try to see if we can get someone to remote in and check it out. Thanks again.

lbt commented 9 years ago

I have this problem too. I've found that restarting the local server encourages the spark to connect. Once connected it seems stable.

kennethlimcp commented 9 years ago

Hey guys! It seems like the latest spark-tinker in the spark-cli would create the SOS behavior. Can someone flash it and verify?

This two tinker files should work well with the local cloud:

https://github.com/spark/core-firmware/releases/tag/spark_3 and flash the core-firmware.bin in the build folder and observe what happens.

Also, try this: https://github.com/spark/spark-cli/blob/10f65d1eba553df807457a5b90388704ea14b2bf/js/binaries/spark_tinker.bin

If this is indeed true, we can now focus on reviewing which new addition of code in firmware version 0.20 onwards are causing issues.

lbt commented 9 years ago

The core-firmware.bin from the spark_3 worked first time on a local cloud :)

lbt commented 9 years ago

binaries in spark_4 5 and 6 worked fine too. spark_7 has problems

git checkout -f spark_XXX (cd build/; sudo dfu-util -d 1d50:607f -a 0 -s 0x08005000:leave -D core-firmware.bin)

I tried to bisect between 6 and 7 but it's quite hard to know where to checkout core-common but I'm getting build failures and it's gone midnight...

kennethlimcp commented 9 years ago

It's really weird that the firmware has some effect on the connection. We should really turn on debug and see the logs for what's going on...

gruvin commented 9 years ago

I have also tested binaries from spark_4/5/6, which were fine. The one from spark_7 is not.

NOTE HOWEVER that if I locally compile from spark_3/4/5/6, the resulting binary is NOT OK. We have also just had confirmation that the build tools at Spark were recently updated (between spark_6 and spark_7) -- most likely to the same version I have, which I believe to be the latest available. @Dave (community forums handle) is currently working to resolve the cause, while others try to provide clues.

My compiler version: gcc version 4.8.3 20131129 (release) [ARM/embedded-4_8-branch revision 205641] (GNU Tools for ARM Embedded Processors)

There's a thread here with (a tonne) more info. But it's quite noisy over there, with a lot of observational data and guessing going on (mostly by me. ;-)

@kennethlimcp ... Thanks to that other post of yours, I think I may be able to get debug mode working now. Will try to work on that later today.

kennethlimcp commented 9 years ago

@gruvin, the Debug bug fix should be in spark_7 already. You can attempt to compile locally for spark_7 and watch the debug log perhaps?

gruvin commented 9 years ago

@kennethlimcp ... yup. That worked. Hasn't shown us much, except perhaps that an unexpected SOCKET_STATUS_INACTIVE occurs before the hard fault.

I tried inserting calls to DEBUG(...) in core-communication-lib/src/spark_protocol.cpp. But the DEBUG macro is not available there and I don't know how to make it so. (I tried.) :-(

Here's the DEBUG output I do see, anyway ...

0000000001:<DEBUG> int main() (159):Hello from Spark!
0000003944:<DEBUG> int Spark_Connect() (750):sparkSocket Now =-1
0000003945:<DEBUG> int Spark_Disconnect() (822):
0000003998:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_ACTIVE
0000004086:<DEBUG> int Spark_Connect() (757):socketed sparkSocket=0
0000004158:<DEBUG> int Spark_Connect() (812):connect
0000004224:<DEBUG> int Spark_Connect() (814):connected connect=0
0000004286:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 40
0000004465:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 384
0000004840:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000004863:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
0000004951:<PANIC> void HardFault_Handler() (88):HardFault

This is from tag:spark_8 (latest master, stock Tinker firmware), built using ... make DEBUG_BUILD=y

The corresponding server log screen shows ...

Connection from: 192.168.1.136, connId: 112
on ready { coreID: '53ff6d065067544853360587',
  ip: '192.168.1.136',
  product_id: 65535,
  firmware_version: 65535,
  cache_key: undefined }
Core online!

I am trying to get some DEBUG output from a successful 'core connect. In DBUG mode, the firmware does not auto-reboot after a hard fault. So it's proving tedious. I'll keep trying for a while and let you know here if I succeed.

kennethlimcp commented 9 years ago

hmm... I'm not at all expert in this :D but it leads me to:

https://github.com/spark/core-firmware/blob/21a75028092a032bee260b944525908421476c20/src/spark_wlan.cpp#L315

sounds like the socket got closed too early? I have no clue..

@gruvin, can you format the debug output above?

gruvin commented 9 years ago

Got it, finally. Here's the DEBUG output from a successful connect. Clearly, things go a lot further, though apparently not in quite the straight line expected ...

0000000001:<DEBUG> int main() (159):Hello from Spark!
0000002940:<DEBUG> int Spark_Connect() (750):sparkSocket Now =-1
0000002941:<DEBUG> int Spark_Disconnect() (822):
0000002994:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_ACTIVE
0000003082:<DEBUG> int Spark_Connect() (757):socketed sparkSocket=0
0000003154:<DEBUG> int Spark_Connect() (812):connect
0000011210:<ERROR> hci_event_handler (258):Timeout now 11210 start 3210 elapsed 8000 cc3000__event_timeout_ms 8000
0000011263:<ERROR> hci_event_handler (259):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007
0000011368:<DEBUG> int Spark_Connect() (814):connected connect=-1
0000011438:<DEBUG> int Internet_Test() (703):socket
0000011493:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000011582:<DEBUG> int Internet_Test() (705):socketed testSocket=1
0000011652:<DEBUG> int Internet_Test() (727):connect
0000011758:<DEBUG> int Internet_Test() (729):connected testResult=0
0000011780:<DEBUG> int Internet_Test() (739):Close
0000011880:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
0000011925:<DEBUG> int Internet_Test() (741):Closed rv=0
0000015006:<DEBUG> int Spark_Connect() (750):sparkSocket Now =0
0000015007:<DEBUG> int Spark_Disconnect() (822):
0000015058:<DEBUG> int Spark_Disconnect() (831):Close
0000015116:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_INACTIVE
0000015206:<DEBUG> int Spark_Disconnect() (833):Closed retVal=0
0000015274:<DEBUG> set_socket_active_status (838):Sd=0, Status SOCKET_STATUS_ACTIVE
0000015363:<DEBUG> int Spark_Connect() (757):socketed sparkSocket=0
0000015435:<DEBUG> int Spark_Connect() (812):connect
0000015494:<DEBUG> int Spark_Connect() (814):connected connect=0
0000015562:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 40
0000015741:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 384
0000016116:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000016138:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE
0000016230:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 2 
0000016313:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 16
0000016401:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 2 
0000016484:<DEBUG> int Spark_Receive(unsigned char*, int) (490):bytes_received 16
kennethlimcp commented 9 years ago

It looks like some improperly closed socket for a beginner like me... @technobly, jump in! :)

gruvin commented 9 years ago

It's odd that the working connection is the one that had the most reported problems getting there, starting with ...

0000011210:<ERROR> hci_event_handler (258):Timeout now 11210 start 3210 elapsed 8000 cc3000__event_timeout_ms 8000
0000011263:<ERROR> hci_event_handler (259):Timeout waiting on tSLInformation.usRxEventOpcode 0x1007

And the final two lines in the working DEBUG output, just before the hard fault in the non-working one, are identical.

0000016116:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_ACTIVE
0000016138:<DEBUG> set_socket_active_status (838):Sd=1, Status SOCKET_STATUS_INACTIVE

Beats the coffee beans out of me. Seems to be either a weird race condition or something screwy with the new version of the compiler. A curly one, by any measure.

dmiddlecamp commented 9 years ago

I just found a race condition with current versions of the core firmware that are exacerbated on the local server, and I believe this server-side workaround should help significantly while we're waiting on the firmware fix to rollout to everyone. I'm hoping this will fix this issue as well. I'll test it some more, and then push to npm when it's ready. -- please free to grab the master branch of spark-protocol in the meantime if you'd like to help test. :)

Thanks! David

kennethlimcp commented 9 years ago

@dmiddlecamp, might need you to describe the race condition more that you found which you patched with https://github.com/spark/spark-protocol/commit/d615ae1eff4d3ea23c0420cee048ef74147d3743.

I'm not sure how to test for this race condition so more information would be required.

As of now, i tested that if this fix is not used (https://github.com/spark/core-firmware/commit/5e18860a47372fe81a433ef4847eced90c1e4d80) we will see SOS flashes still even with the new patch you added.

dmiddlecamp commented 9 years ago

Hmm, it's possible that the core firmware is simply sending too much too fast during the startup / handshake procedure with the addition of the UDP multicasts, and that's causing it to crash.

The spark-protocol patch fixes a race condition that caused premature messages from the core to not be handled in time on the server, which caused the session with that core to be dropped.

kennethlimcp commented 9 years ago

I only saw the BOOM early data once out of the few connections during testing. Probably spark cloud has tons of users while i am testing on just one so it might not be obvious.

Maybe the broadcast is a separate issue from this but it's still hard to tell.

Let me know how i can help further!

KarbonDallas commented 9 years ago

I haven't been able to reproduce this issue with current versions of firmware and spark-server. If anyone is still having trouble with this please feel free to reopen the issue.