phoddie / node-red-mcu

Node-RED for microcontrollers
120 stars 18 forks source link

ESP8266 target crashing on startup after connecting to wifi #83

Closed colinl closed 1 year ago

colinl commented 1 year ago

Using @ralphwetzel/node-red-mcu-plugin installed from git #cf57d3d7f0ebac26285bba41b2148d60cbb32446, building a simple flow for a Wemos D1 Mini (configured in the plugin as a Nodemcu), the target crashes after connecting to the wifi.

In node-red-mcu/setupwifi.js in the code

           case WiFi.gotIP:
                trace(`IP address ${Net.get("IP")}\n`);

                if (!sntp || (Date.now() > 1672722071_000)) {
                    done?.();
                    done = undefined;
                    return;
                }

If I put a breakpoint on the line done?.() it breaks ok and if I step then it moves forwards to the next line, and if I let it run then the flow runs correctly. However, if I put the breakpoint instead on done = undefined then it shows the IP address in the log but then immediately crashes and restarts. At the suggestion of @ralphwetzel in this forum thread I hid the esp entry in @ralphwetzel/node-red-mcu-plugin/node-red-mcu/manifest_core.json so that it used legacy code, and rebuilt, but the behaviour was the same.

This seems to happen if I have any contrib nodes installed, but without those it is ok. This simple example uses node-red-contrib-pid

[{"id":"a74a29ab51318a3a","type":"inject","z":"f92bb64b93bfc46e","name":"","props":[{"p":"payload"},{"p":"topic","vt":"str"}],"repeat":"5","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"","payloadType":"date","_mcu":{"mcu":true},"x":170,"y":140,"wires":[["9db682e1f6a487a8"]]},{"id":"9db682e1f6a487a8","type":"debug","z":"f92bb64b93bfc46e","name":"debug 80","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"payload","targetType":"msg","statusVal":"","statusType":"auto","_mcu":{"mcu":true},"x":370,"y":200,"wires":[]},{"id":"6d9adcf8cdc36b27","type":"debug","z":"f92bb64b93bfc46e","name":"debug 81","active":false,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","_mcu":{"mcu":true},"x":440,"y":300,"wires":[]},{"id":"ed0d63aad317e829","type":"inject","z":"f92bb64b93bfc46e","name":"","props":[{"p":"payload"}],"repeat":"2","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"1","payloadType":"num","_mcu":{"mcu":true},"x":90,"y":380,"wires":[["a83bf3122efb505c"]]},{"id":"a83bf3122efb505c","type":"trigger","z":"f92bb64b93bfc46e","name":"","op1":"0","op2":"1","op1type":"num","op2type":"num","duration":"500","extend":false,"overrideDelay":false,"units":"ms","reset":"","bytopic":"all","topic":"topic","outputs":1,"_mcu":{"mcu":true},"x":280,"y":380,"wires":[["6d9adcf8cdc36b27","32dcd5abbb576d04"]]},{"id":"7180d313e18f3785","type":"debug","z":"f92bb64b93bfc46e","name":"debug 82","active":true,"tosidebar":true,"console":false,"tostatus":false,"complete":"false","statusVal":"","statusType":"auto","_mcu":{"mcu":true},"x":440,"y":480,"wires":[]},{"id":"32dcd5abbb576d04","type":"rpi-gpio out","z":"f92bb64b93bfc46e","name":"","pin":"2","set":"","level":"0","freq":"","out":"out","bcm":true,"_mcu":{"mcu":true},"x":450,"y":380,"wires":[]},{"id":"e0f56ffc112362dd","type":"inject","z":"f92bb64b93bfc46e","name":"","props":[{"p":"payload"}],"repeat":"2","crontab":"","once":false,"onceDelay":0.1,"topic":"","payload":"0.5","payloadType":"num","_mcu":{"mcu":true},"x":90,"y":480,"wires":[[]]},{"id":"b74829da6239b15d","type":"PID","z":"f92bb64b93bfc46e","name":"","setpoint":21,"pb":1,"ti":9999,"td":0,"integral_default":0.5,"smooth_factor":3,"max_interval":600,"enable":1,"disabled_op":0,"_mcu":{"mcu":true},"x":220,"y":580,"wires":[[]]}]
phoddie commented 1 year ago

Thanks for the report. The flow itself looks reasonable. I will take a look at what's going on.

phoddie commented 1 year ago

Alas, I am unable to reproduce the problem. The flow runs normally for me. Let's try to sort out what might be different.

To run your flow, I followed the instructions to run it from the command line. It looks like you are running from the MCU plug-in. That should be fine, but since we are looking for what is different, maybe you could try the command line build, if it isn't too much trouble. I used the -p esp/node_mcu to match what your build.

There seems to be enough memory. Here's what I see in xsbug instrumentation at the breakpoint you set. If you see anything significantly different, that might be a hint.

image

You appear to be running with Wi-Fi enabled. This particular flow doesn't actually use Wi-Fi so it can run without that. If you remove the Wi-Fi credentials, does it make a difference?

colinl commented 1 year ago

Firstly, it seems I had some sort of build problem with my test to see if it was ok if I hid the "esp" option in manifest-core, perhaps I had not done the build clean correctly. Now that I repeat that the problem is gone, and I see different code in setupwifi.js. So with that legacy code all is well. I will correct my description of the bug. Returning the current code, I should have said that if I don't provide wifi credentials then again there is no problem. I will add that to the description. Another fact I have determined is that if I power up the ESP independently, then it works fine. If I then use Reconnect to xsbug in the plugin UI, the target restarts and crashes as before. Again, possibly consistent with it being a subtle timing issue. I will have a look at running it from the command line, if I can work out how to do that. I have only used it from the plugin so far.

colinl commented 1 year ago

A bit more information. By making minor changes to the flow it is possible to make the problem appear or disappear. However, adding a 1 sec async wait after the line trace(IP address ${Net.get("IP")}\n); appears to have completely fixed this issue. I am able to make changes to the flow without seeing any problems with this code.

phoddie commented 1 year ago

Thanks for the additional information and congratulations on finding a workaround.

It doesn't obviously look like a timing issue, but I can't explain your results either. Unfortunately, I can't reproduce them either so I'm kind of stuck at the moment.

colinl commented 1 year ago

I am looking at running it independently of the plugin, but I can't work out exactly how to install node-red-mcu stand alone. It doesn't appear to describe that in the readme, or at least I can't see it.

colinl commented 1 year ago

I have worked out how to build and run it directly, rather than using the plugin, but it says Disabling unsupported node type "PID"! There is some mention of how to include contrib nodes in the node-red-mcu docs but it isn't very clear exactly what I have to do. It says to see the lower_case example, and the only reference to that I can find is in the nodes folder. I tried putting a copy of node-red-contrib-pid there and I see that there is also a manifest file in the lower_case example, so I put in the pid folder a manifiest.json containing

{
    "modules": {
        "*": "./PID"
    },
    "preload": "PID"
}

but it still says it is disabling it, so obviously I am missing something. Can you tell me what I am doing wrong please?

phoddie commented 1 year ago

I have worked out how to build and run it directly, rather than using the plugin

Excellent!

but it says Disabling unsupported node type "PID"!

Good. That's what it should do when the node implementation cannot be found.

I'm going to assume that the PID note is this one, since it was written by you. Since there are not external dependencies, it looks like it should work. I'll give that a closer look later and let you know what I find.

phoddie commented 1 year ago

I was able to get the PID node to register. Here's what I did

I ran your test flow above. In it, the PID node isn't connected to anything so it doesn't do much. But... I can set breakpoints on its calls to RED.nodes.registerType, RED.nodes.createNode and this.on('input', function(msg) and they all work. If you can get that far, you should be able to build a flow that does something a little more interesting with the PID node. ;)

colinl commented 1 year ago

Sorry about not including the node name, you should not have had to waste your time working that out. I cannot believe that I left it out of the description. Yes I know the flow doesn't do anything worth while (apart from flash the led so I can see if it is working). I reduced the flow down in order to make the problem easier to diagnose. I can confirm that with the flow running in this environment it does not fail, but I can also re-confirm that the same flow running in the plugin environment does still fail consistently, unless I have the 1 second delay after the wifi connects. So there must be some significant difference between the two environments. I will report back to @ralphwetzel, I don't know whether he is watching this.

phoddie commented 1 year ago

No problem. At least I understand the goal now. ;)

I can try to take a look in the plug-in too. It isn't what I try first, because it isn't as convenient for my usual debugging / development work on the code of Node-RED MCU Edition.

colinl commented 1 year ago

While I have your attention, on another flow, after running for a while I get an exception Panic /home/colinl/esp/esp8266-2.3.0/cores/esp8266/core_esp8266_main.cpp:131 loop_task I looked at the code but it isn't obvious what that implies. When I come across such problems what is the best way to ask for suggestions?

phoddie commented 1 year ago

This repository is a fine place to ask since it is being triggered by the Node-RED MCU Edition runtime. If you suspect a more general Moddable SDK issue, the Moddable SDK repository is the place to ask.

The error you are getting in loop_task is probably a native stack overflow -- the stack guard is corrupted. It may be that your crash on start is also a native stack overflow, which would perhaps explain its intermittent behavior and hard crash.

If you are feeling bold, you can try increasing the native stack. That's a manual edit to esp8266-2.3.0/cores/esp8266/cont.h to change #define CONT_STACKSIZE 4096 to something bigger. You can't make it too big, or other parts of the system will run out of memory. Try 5 or 6 KB:

//#define CONT_STACKSIZE 4096
#define CONT_STACKSIZE 5120
//#define CONT_STACKSIZE 6144

You may need to do a clean build from there to ensure it picks up the changes. The easiest way to do that may be to manually delete $MODDABLE/build/tmp/esp

phoddie commented 1 year ago

I updated the MCU Plugin and tried the flow. Of course, it works just fine. :(

I added some traces to see the native stack use. When connected to Wi-Fi and running the test flow with the PID node installed, the native stack never gets below 1260 bytes free which should be a safe margin.

colinl commented 1 year ago

I am wondering whether the issue is something to do with the way xsbug is invoked, or the way it interacts with the plugin. I am going to do some more tests later today.

colinl commented 1 year ago

I think I am on the right track. If I build and deploy it in the plugin then it goes into the start, attach to Wifi, crash, restart loop. If I then close down the xsub window then after a few seconds the target restarts and runs normally (the flow flashes the LED so I can tell that). So there is nothing wrong with the target code itself, it is to do with how it being run. Also, if I go to the project directory in the plugin and run the mcconfig command from script file that the plugin generates, that is mcconfig -d -x localhost:5004 -m -p esp/nodemcu ssid="***" password="***" then it crashes as before, but if I run mcconfig -d -x localhost:5004 -m -l -t deploy -p esp/nodemcu ssid="***" password="***" so that it does not start xsub, then it runs as it should.

colinl commented 1 year ago

I have moved this to a new issue on the plugin, since it only affects that environment. https://github.com/ralphwetzel/node-red-mcu-plugin/issues/17

phoddie commented 1 year ago

Sounds good. If there's something more I can try, please let me know. Closing this one for now.