Hardware watchdog... how to find the cause?

giig1967g commented 6 years ago

Hi, in the last versions I am experiencing a constant reboot (every other day) with "Hardware watchdog" as the Reboot cause. I have also changed from Static IP to DHCP. How can I find the cause of the Hardware watchdog? What does exactly it means?

TD-er commented 6 years ago

If some piece of code is running for over 6 seconds without calling any delay or yield, it will trigger a hardware watchdog, which performs a reset.

So there is some code in your setup either waiting for that long, or running an "infinite loop". Could you please give more info on your setup? Also do not set the "MessageDelay" too high, nor use "delay" in the rules.

thomastech commented 6 years ago

@TD-er: FYI, the latest crash/reboot I mentioned in #1643 was reported in the GUI as Reset Reason | Hardware Watchdog. This was the test system that had run for a day, then the WiFi went offline for a couple hours, then the board rebooted on its own. Might be related to this issue, or maybe not.

Thomas

giig1967g commented 6 years ago

here my config schermata 2018-08-18 alle 11 42 14

TD-er commented 6 years ago

See my comment here: https://github.com/letscontrolit/ESPEasy/issues/1659#issuecomment-414047835 Looks like my nodes are also "affected", which is good :)

thomastech commented 6 years ago

@TD-er: Yesterday I loaded the ESP_Easy_mega-20180815_test_ESP8266_4096 build on a NodeMCU. It ran great for 18 hours then rebooted. System Info says: Boot : Manual reboot (1), Reset Reason : Hardware Watchdog.

A second duplicate NodeMCU is still running fine. But it has only been running for 17 hrs, so it may face the reboot dance soon.

Thomas

TD-er commented 6 years ago

Would be great to see if it occurs at the same interval, or time of day. Maybe it is some NTP refresh, or something else, who knows.

thomastech commented 6 years ago

@TD-er: That would be great. But so far I have not seen a pattern that indicates it is triggered by run duration or time of day.

My hunch is that it is something related to WiFi, such as a reconnect. But I have tried to torture the WiFi connection (force router offline, create weak RF signal levels) and nothing bad happened. So my hunch seems to be nonsense. Hopefully you find the cause and save us.

Thomas

thomastech commented 6 years ago

It rebooted again after running for about 2 hours. Now reports Manual reboot (2), Reset Reason : Hardware Watchdog.

The second duplicate NodeMCU is still running fine. About 19+ hours so far.

Thomas

TD-er commented 6 years ago

I found an issue with the handling of UDP traffic (when C013 is used). That could cause Exception crashes. (not likely a Watchdog reset) I also added some checks when creating an UDP client for NTP, to see if that may cause infinite waiting. Those can cause a watchdog reset.

When tested, I will make a commit for it.

thomastech commented 6 years ago

@TD-er I've been testing the new builds as they are released. So far none have solved the Watchdog reset. However, the latest ESP_Easy_mega-20180922_dev_ESP8266_4096.bin build seems better than the last release.

The latest firmware has been successfully running on one of my NodeMCU's (current duration 1 day 22 hrs, no reboots). But the second NodeMCU has rebooted several times.

I have a feeling the rebooting is related to the WiFi access since the latest reboot occurred when I accessed the device from my browser. I also noticed a previous reboot occurred at a time when the good "working" device reported a beacon timeout. But I can't replicate the reboots on demand.

SysInfo reports this: Boot: Manual reboot (8) Reset Reason: Hardware Watchdog

Not sure if you are interested in this feedback. But it's been a month+ since my last updated and I thought I'd keep you posted on my findings.

Thomas

TD-er commented 6 years ago

I'm always interested in feedback, especially when there seems to be a bit of improvement :)

thomastech commented 6 years ago

The reboot issue certainly has been a tough nut to crack.

My winning streak with the "working" device just ended. After 48 hours runtime it rebooted. Sysinfo reports Manual reboot (1) Hardware Watchdog

Thomas

TD-er commented 6 years ago

See PR #1834 I just merged a change which moves the address space of the Arduino stack to be on top of the System stack. The latest core library appears to have shifted the Arduino stack to overlap a bit with the System stack to save about 4k of memory. But since we're allocating quite a lot on the System stack, this may have led to an increase in reports of HW Watchdog resets.

So please test with the October 1st build, as soon as that one is ready.

thomastech commented 6 years ago

I've installed ESP_Easy_mega-20181001_dev_ESP8266_4096.bin one two NodeMCU's. I will report back tomorrow (or sooner) if they experience a reboot.

Some initial comments:

The 4K decrease in system free ram is disappointing. If this change to stack memory does not help then please consider reverting back.
Not sure if it is related to the new firmware, but web page access is slow. Sometimes navigating the tabs take several seconds for each new page to populate.

Thanks again for your efforts. Fingers are crossed that this merge helps cure the W-dog reboot issue.

Thomas

TD-er commented 6 years ago

Hmm I had the impression it was loading a bit faster with the changed Arduino stack location. But maybe getting the free stack for statistics is using quite some resources and is called a bit more in some functions

thomastech commented 6 years ago

@TD-er: I don't know why, but the slow page loading has gone away and refresh is OK now. If there are no other reports of it then I'd say it was an isolated issue related to my WiFi router.

Thomas

TD-er commented 6 years ago

Pfiew, I was afraid you had to report a crash/reboot already.

TD-er commented 6 years ago

I was thinking about this.... The webserver has some mechanism to free memory when it is too low. This freeing memory may take some time which makes the webinterface slow down.

thomastech commented 6 years ago

Interesting, I wasn't aware the webserver could do that kind of magic. It may have been involved because during the slow browser response the system load was about 40%. But after the problem went away it settled down to about 30% load.

Now the bad news. One of the test devices rebooted after 10 hours. System info report: Boot: Manual reboot (1) Reset Reason: Software Watchdog

I'll let the other device continue to run until it reboots. Maybe the force is stronger with this one.

Thomas

TD-er commented 6 years ago

It is not that bad, since a software watchdog is different from a hardware watchdog. The software version means it is still doing stuff

thomastech commented 6 years ago

You shouldn't have said that it is not that bad because Murphy is watching us. The second device rebooted at 19 hours due to hardware reset. System info report: Boot: Manual reboot (2) Reset Reason: Hardware Watchdog

BTW, the other device rebooted again a few minutes ago. Another Software reset. System info report: Boot: Manual reboot (3) Reset Reason: Software Watchdog

Thomas

TD-er commented 6 years ago

OK, so that's not the fix :( Can you also/already test using this PR: https://github.com/letscontrolit/ESPEasy/pull/1838 ? I will later this evening add extra settimeout calls to other WiFi client instances, so it is not complete yet.

thomastech commented 6 years ago

No problem, I'll test PR #1838 after it is incorporated in the nightly build.

Thomas

thomastech commented 6 years ago

I'm running both devices on ESP_Easy_mega-20181004_dev_ESP8266_4096.bin. Twelve hours so far, no reboots. Fingers crossed.

Thomas

giig1967g commented 6 years ago

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

s0170071 commented 6 years ago

@TD-er do you think this could help?
Somebody just needs to figure out how to conveniently link the .elf file from the build date to the exception decoder...

Another thought on this: If it is possible to catch the exception and save stuff, can't we just catch that exception too and do something with it ? Save some text, send an email, ignore it and carry on ?

TD-er commented 6 years ago

@thomastech Just curious, what are the memory and stack stats of that node running the dev build?

@s0170071 That's a very nice library. I think we could try it, to see what's happening. Also I could read the last crash log at boot and write it to SPIFFS. Or just add a 'crash log report' option to send the crash to a server I think it deserves its own issue. (to make it easier to find)

thomastech commented 6 years ago

Just curious, what are the memory and stack stats of that node running the dev build?

Both NodeMCU devices have the same ESPEasy configuration. Here's a System Info snapshot.

Device 1 ("production" NodeMCU): Load 30.60% (LC=10052) Free Mem 10328 (8968 - sendContentBlocking) Free Stack 3584 (144 - LoadTaskSettings)

Device 2 ("Test" NodeMCU): Load: 27.40% (LC=10648) Free Mem: 9920 (8272 - sendContentBlocking) Free Stack: 3520 (144 - LoadTaskSettings)

Thomas

TD-er commented 6 years ago

OK, so the Arduino stack is dangerously low at some point. The lowest in both configurations is 144 bytes. I will increase the Arduino stack to 5k (default = 4k) just to be sure and then start hunting for stack usage.

Also the heap is used quite intensively, so that one should also be looked at.

thomastech commented 6 years ago

I will increase the Arduino stack to 5k (default = 4k) just to be sure and then start hunting for stack usage.

That sounds like a good thing to try.

One device rebooted after 19 hours. Sys Info on rebooted unit: Load: 27.80% (LC=10652) Free Mem: 11064 (9528 - sendContentBlocking) Free Stack 3584 (144 - LoadTaskSettings) Boot: Manual reboot (1) Reset Reason: Hardware Watchdog

The second test unit is still OK, 22 hrs run time so far.

Thomas

s0170071 commented 6 years ago

About the stack: Normally, if you return from somewhere back to the main loop, the stack should only hold local variables defined by the main loop. Anything that comes on top, i.e. drains the stack is then either

interrupt routine stuff, sometimes they can also be nested- but once your back to loop that should be ok.
Core stuff (Wifi) but that goes to the sys stack, afaik.

Did I forget something ?

How can it be then possible to have a leak on the stack ? Its just not how it works.

It just can be a local variable that "grows", i.e. a string or a list.

TD-er commented 6 years ago

Lists and Strings grow only on the heap, even when declared on the stack. And I don't think we have a leak, but we are just using too much in some calls. Either the depth of calls or number of stack allocated variables (or their size) is the problem.

giig1967g commented 6 years ago

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

Unfortunately one of units just rebooted with Hardware Watchdog when connected to Mikrotik router. The unit connected to ASUS router is still working after several days.

s0170071 commented 6 years ago

@TD-er is this fixed already ? Certainly WiFi related ;-)

giig1967g commented 6 years ago

I am running 20181002 with Mikrotik router. No reboots so far. Running for 2 days and 4 hours.

Unfortunately one of units just rebooted with Hardware Watchdog when connected to Mikrotik router. The unit connected to ASUS router is still working after several days.

The same unit that had HW now does not respond to web server (but works "internally"). This unit is connected to Mikrotik. So definetely recent changes do not solve the wifi connection issues.

giig1967g commented 6 years ago

please see #1640

cs3395202 commented 6 years ago

Stop

On Oct 5, 2018 12:14 PM, "Plebs" notifications@github.com wrote:

please see #1640 https://github.com/letscontrolit/ESPEasy/issues/1640

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/letscontrolit/ESPEasy/issues/1656#issuecomment-427436515, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap1Ge2QPDcw80AmBOZZQTAHP4G4pjLhmks5uh5NxgaJpZM4WAR2m .

TD-er commented 6 years ago

@s0170071

TD-er is this fixed already ? Certainly WiFi related ;-)

Forgot it, will add it now to merge with the stack5k fix.

Domosapiens commented 6 years ago

Sorry to say: The dog is still there. 2 units running the daily build: Wemos D1 mini, running rcwl-0516.

Wemos D1 mini, running a 1602 LCD, rcwl-0516 as Display button and 2# DS18B20.

In the Log: Seems not correlated to the Watchdog.

TD-er commented 6 years ago

I found something very strange this evening.

I have installed the Chrome plugin "JSONView", which will check and format JSON. This makes it very easy to check if JSON is valid.

I set the log level of the web log to "Debug More" (to do a lot of memory allocations) and then loading the JSON via /logjson#

On almost every reload I have corrupted JSON. The output is filled with strings from other webserver requests.

This can also explain some of the issues related to some controllers and perhaps also lead to watchdog resets.

{"Log": {"Entries": [{"timestamp":373907,
"text":"UDP  : 2C:3A:E8:32:90:8B,192.168.1.92,7",
"level":4},
{"timestamp":377450,
"text":"SYS  : 6.00",
"level":2},
{"timestamp":377453,
"text":"EVENT: uptime#uptime=6.00",
"level":2},
{"timestamp":377455,
"text":"EVENT: uptime#uptime=6.00 Processing time:2 milliSeconds",
"level":3},
{"timestamp":377456,
"text":" Domoticz: Sensortype: 1 idx: 209 values: 6.00",
"level":2},
{"timestamp":377487,
"text":"HTTP : C001 connecting to domoticz:8080",
"level":3},
{"timestamp":377494,
"text":"GET /json.htm?type=command&param=udevice&idx=209&nvalue=0&svalue=6.00&rssi=7 HTTP/1.1
^Host: domoticz:8080
^User-Agent: ESP Eas",
"level":3},
{"timestamp":377494,
"text":"/json.htm?type=command&param=udevice&idx=209&nvalue=0&svalue=6.00&rssi=7",
"level":3},
{"timestamp":377504,
"text":"HTTP/1.1 200 OK
",
"level":4},
{"timestamp":377504,
"text":"HTTP : C001 Success! HTTP/1.1 200 OK
",
"level":3},
{"timestamp":377504,
"text":"HTTP : C001 closing connection",
"level":3},
{"timestamp":378549,
"text":"WD   : Uptime 6 ConnectFailures 0 FreeMem 10208",
"level":2},
{"timestamp":378550,
"text":"UDP  : Send Sysinfo message",
"level":4}],
"TTL":1000,
"timeHalfBuffer":2708,
"nrEntries":13,
"SettingsWebLogLevel":4,
"logTimeSpan":4643
}}

As can be seen, it outputs a lot of other return values, like HTTP/1.1 200 OK

uzi18 commented 6 years ago

nothing wrong with this 200 OK. it is just log from C001 with level 4.

TD-er commented 6 years ago

@uzi18 You're right. I found (and fixed) the bug where the left over \r wasn't replaced when converting it to JSON.

I will have a look at the changes lately that so a lot with memory allocations and also try to find allocations on the stack.

s0170071 commented 6 years ago

@TD-er I am currently updating the checkram function to monitor stack too. First output did not yield useful results. If you go hunting for stack usage, how about warpping function calls in a "SPIFFS_CHECK" like define that compares the stack pointer before and after the function call ? That should reveal stack eaters...

TD-er commented 6 years ago

That's a good one. I also want to know more about the heap, for example if it gets really fragmented or not. (e.g. what's the max continuous block)

s0170071 commented 6 years ago

Later, once it is known what requires substantial amounts of stack (e.g. spiffs), the available amount can be checked before making the chall and then ? Reboot instead of just crash ? Fragmentation may be an issue too, but I have seen crashes with 17k heap free and no activity. So its not too likely the bug we're after.

TD-er commented 6 years ago

That last remark sounds reasonable and may save a lot of searching :)

s0170071 commented 6 years ago

If we had heap fragmentation, what would happen if you cannot allocate new memory ? I would assume the pointer returned by new() to be null. Is this correct ? If so, a viable test could be to now and then try to allocate some useful heap (leave 3k free for wifi) and see if that worked. And then just free it again.

TD-er commented 6 years ago

In the staged version of the core lib there is some development on that: https://github.com/esp8266/Arduino/pull/5090/commits

So I can have a look at that and make a test build which will also show the heap statistics when available. Just to get some idea on what's happening.

Allocations with new should indeed return a NULL pointer, but String will fail silently.

s0170071 commented 6 years ago

very good. Seems to be an issue then. About the strings failing silently: if you allocate / new() some memory block, free() it again you should be able to string.reserve() it afterwards without trouble, right ?

Sounds like a wrapper function #define for String.reserve()

TD-er commented 6 years ago

std::string is traditionally (in STL library for C++) a standard container, which does the allocation/deallocation for you. The Arduino String class is loosely based on the same principle, only with some extra's and also some other functions missing. Maybe we can check for the actual capacity of the String after calling a reserve. Not sure yet if those are publicly accessible. But you shouldn't do new and free/delete on the members of String or else members will get out of sync.

letscontrolit / ESPEasy

Hardware watchdog... how to find the cause? #1656