letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

[OpenHAB MQTT] Crash due to yield panic #1625

Closed TD-er closed 4 years ago

TD-er commented 6 years ago

As reported here: https://github.com/letscontrolit/ESPEasy/issues/1574#issuecomment-410065300

Suggested fix: Make the callback function just register handling of the MQTT as a new job in the scheduler. This will result in all calls made during processing of the MQTT message being called from the loop() and thus be allowed to call yield() or delay()

It's due to a panic reboot. Happened after a MQTT publish to my Openhab home automation system. Here's the log:

11514348 : NEXTION075 : code: |s,i16,sOff
11514353 : EVENT: NEXTION#idx=16.00
11514394 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,16
11514408 : Command: publish

Panic C:\Users\Tom\AppData\Local\Arduino15\packages\esp8266\
hardware\esp8266\2.4.1\cores\esp8266\core_esp8266_main.cpp:99 __yield

ctx: cont 
sp: 3fff3e00 end: 3fff4750 offset: 01b0

>>>stack>>>
3fff3fb0:  00000001 00000001 3fff40f4 3fff2419  
3fff3fc0:  00000001 00000001 3fff899c 4025fab8  
3fff3fd0:  00000001 00000001 3fff899c 40241651  
3fff3fe0:  00000c01 00000000 00000000 40100200  
3fff3ff0:  00000000 00000010 00000000 00000000  
3fff4000:  00000000 00000000 3fff8b14 0000000f  
3fff4010:  00000000 3fff8b2c 0000000f 00000000  
3fff4020:  3fff8b44 0000000f 00000000 3fff8b5c  
3fff4030:  0000000f 00000000 3fff8b74 0000000f  
3fff4040:  00000000 00000000 6c627550 00687369  
3fff4050:  00000046 3fff4110 0000000d 402173d7  
3fff4060:  00000600 4c5f1800 00000000 4e2f7901  
3fff4070:  00000000 00000010 00000000 00000000  
3fff4080:  00000000 00000000 00000000 00000000  
3fff4090:  00000000 00000000 50003600 4c5f5a45  
3fff40a0:  646e7561 4e2f7972 49545845 692f4e4f  
3fff40b0:  00007864 00000000 00000000 00000000  
3fff40c0:  00000000 00000000 0000032d 4025ea7b  
3fff40d0:  312c7864 00000036 3fff41dc 4025eaad  
3fff40e0:  3fff41dc 00000000 3fff26ae 00000006  
3fff40f0:  00000001 3fff8afc 0000000f 00000000  
3fff4100:  00000001 00000001 00000001 40241dd5  
3fff4110:  00000c00 00000000 00000000 3fff5300  
3fff4120:  00000000 00000010 00000000 00000000  
3fff4130:  00000000 00000000 3fff8a34 0000000f  
3fff4140:  00000000 3fff8a4c 0000000f 00000000  
3fff4150:  3fff8a7c 0000000f 00000000 3fff8a94  
3fff4160:  0000000f 00000000 3fff8aac 0000000f  
3fff4170:  00000000 00000000 4e5b2c78 49545845  
3fff4180:  69234e4f 0d5d7864 6520200a 6669646e  
3fff4190:  20200a0d 5b206669 5458454e 234e4f49  
3fff41a0:  5d786469 30353d3e 2f202030 6f54202f  
3fff41b0:  20686375 6e657645 40107b10 00000000  
3fff41c0:  000003e8 00000000 3fff87fc 3fff883c  
3fff41d0:  00000000 00000000 00000000 3fff8ac4  
3fff41e0:  0000002f 00000025 3fff89d4 0000002f  
3fff41f0:  00000025 3fff899c 0000002f 00000025  
3fff4200:  3fff87e4 0000000f 00000000 3fff8964  
3fff4210:  0000002f 00000025 3fff891c 0000003f  
3fff4220:  00000025 3fff8a0c 0000001f 0000001b  
3fff4230:  00000001 00000001 00000001 00000001  
3fff4240:  3fff42d0 00000010 3fff4280 00000040  
3fff4250:  00000000 3fff2419 3fff428c 4025ebb0  
3fff4260:  3fff0031 3fff2419 00000001 00afb1f1  
3fff4270:  3fff42d0 3fff2419 00000001 40241f93  
3fff4280:  00000000 00000000 00000000 3fff87cc  
3fff4290:  0000000f 0000000a 3fff879c 0000000f  
3fff42a0:  0000000a 00000030 3fff42d0 4025f058  
3fff42b0:  3fff259e 00000000 00000000 00000005  
3fff42c0:  00000010 00000004 00000000 40242b55  
3fff42d0:  3fff8774 0000001f 00000011 4010020c  
3fff42e0:  3fff23c4 3fff0e58 3fff4310 40100690  
3fff42f0:  00000000 00000000 00000000 00000011  
3fff4300:  3fff1b68 3fff26a8 3fff4610 40244435  
3fff4310:  00000000 00000000 00000000 00000011  
3fff4320:  3fff1b68 3fff0e58 3fff4610 40264db4  
3fff4330:  ae4f0759 3fff0e58 3fff4560 40262844  
3fff4340:  3fff0030 3fff0e58 3fff45a8 00000011  
3fff4350:  3fff1b68 3fff0e58 3fff4610 4024a0c7  
3fff4360:  692c737c 732c3631 0066664f 0083126e  
3fff4370:  3fff5ccc 3fff4670 3fff4478 00000030  
3fff4380:  4021515a 3fff4670 3fff64b4 3fff645c  
3fff4390:  0000072f 0000002f 00000000 00000089  
3fff43a0:  00000000 00000000 0000001f 40104ec9  
3fff43b0:  4000050c 3fff1b00 3fff4470 00000030  
3fff43c0:  40000f83 00000030 00000000 ffffffff  
3fff43d0:  40285942 00000023 3fffbff8 00000e71  
3fff43e0:  000006b0 00000e70 00000001 3fff8a78  
3fff43f0:  000000c2 3fff64dc 00000001 3fff4b5c  
3fff4400:  3fff3640 00000000 00007388 00000030  
3fff4410:  40103e13 3ffef740 50800000 4000050c  
3fff4420:  3fffc278 40101f8c 3fffc200 00000022  
3fff4430:  3fff4440 00000023 3fff84fc 40214afa  
3fff4440:  ffffe660 00000020 00000000 4010053d  
3fff4450:  401004ba 00000023 00000003 3fff4490  
3fff4460:  3fff1b00 3fff4468 3fff4670 40214bbc  
3fff4470:  3fff4470 3fff4470 3fff4478 3fff4478  
3fff4480:  3fff4480 3fff4480 3fff4488 3fff4488  
3fff4490:  3fff4490 3fff4490 3fff4498 3fff4498  
3fff44a0:  3fff44a0 3fff44a0 3fff44a8 3fff44a8  
3fff44b0:  3fff44b0 3fff44b0 3fff44b8 401071ee  
3fff44c0:  3fff44c0 3fff44c0 3fff4610 40210069  
3fff44d0:  3fff44d0 3fff44d0 3fff44d8 40244d60  
3fff44e0:  3fff44e0 3fff44e0 3fff44e8 3fff44e8  
3fff44f0:  3fff44f0 3fff44f0 3fff44f8 3fff44f8  
3fff4500:  00000000 000006c0 000006c0 4010020c  
3fff4510:  3fff3640 000006c0 000006c0 00000046  
3fff4520:  3fff3648 00000000 3fff16d0 40285968  
3fff4530:  00000014 000035d8 000006bb 40207b90  
3fff4540:  00000010 00000010 3fff178c 4028f34f  
3fff4550:  4025d1f4 3fff4610 3fff178c 4025ebb0  
3fff4560:  00000000 00000000 00000000 4020b47c  
3fff4570:  3fff3640 000006bb 000006bb 4010020c  
3fff4580:  00000004 3fff4610 3fff45b0 40100690  
3fff4590:  3fff875c 0000000f 0000000b 3fff8744  
3fff45a0:  0000000f 00000000 3fff8r‚)ŠLþ
ôŒ
JE0¸
†´ˆˆ9!9¯„ù„!ÁÎì)ÿ18Áù526 : INIT : Free RAM:21328
527 : INIT : I2C
527 : INIT : SPI not enabled
541 : NEXTION075 : serial pin config RX:13, TX:15, Plugin Enabled
543 : NEXTION075 : Using software serial
599 : INFO : Plugins: 71 [Normal] [Testing] (ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3)
696 : EVENT: System#Wake
745 : WIFI : Set WiFi to STA
TD-er commented 6 years ago

Maybe you could test this PR: https://github.com/letscontrolit/ESPEasy/pull/1627

I just tested it will compile and did not crash my ESP32 node when pressing the test notification. But since it is already close to 2am here, I see an opportunity for those still in day-light side of the globe ;)

thomastech commented 6 years ago

I loaded PR 1627 into two nodeMCU boards. Identical configurations, except only one of them will do MQTT publish.

I'll run them overnight and check their uptime in the morning. If uptime is wrong then we will know that there was at least one reboot.

thomastech commented 6 years ago

13 hours later ... The NodeMCU board that didn't do MQTT is still OK. No resets (uptime correct).

The other board was reset several times. Resets occurred about every 2-4 hours. I'm now running with serial log, Debug Dev. Maybe there will something in the expanded log that helps ID this.

thomastech commented 6 years ago

It's important to note that although both boards have identical software configurations, the one that is working well is missing the the MPU6050 sensor I use to detect machine movement. So later today I will disable the MPU6050 and see if that stops the resets.

thomastech commented 6 years ago

I captured two panics in the log. Maybe there's a bread crumb in there that helps to ID what is going on. See entries 11575248 and 899358. panic2.log

TD-er commented 6 years ago

Is there a possibility the broker is busy or for some other reason not answering?

thomastech commented 6 years ago

That is a good question. Openhab is running correctly. It's installed on a RaspberryPi Model 3+.

I studied the log messages and they seem to suggest that the panic occurred while in misc.ino's rulesProcessingFile(). That's because the last log entry before the crash is "ACT :" which is in rulesProcessingFile(). This log output is printed immediately before the PluginCall().

Also, whenever the openhab controller _C005.ino does a MQTTpublish() it prints a "MQTT :" log entry. But we don't see that in the logs. So I don't know if the publish ever occurred.

Not sure what these debug crumbs are telling us. Do you have any ideas?

For completeness, here's the logs leading up to the panic crash:

11575136 : EVENT: NEXTION#idx=26.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 110: nextion#idx do
11575183 : Read settings: TaskSettings index: 1
11575189 : Read settings: TaskSettings index: 2
11575192 : Read settings: TaskSettings index: 4
11575196 : DEBUG DEV: Parsed String='  if 26'
11575199 : Read settings: TaskSettings index: 1
11575208 : Read settings: TaskSettings index: 2
11575211 : Read settings: TaskSettings index: 4
11575213 : DEBUG DEV: Parsed String='  if 26>=10 and 26'
RuleDebug: 111: if 26>=10 and 26<=30
11575220 : [if 26>=10 and 26<=30]=true
11575225 : Read settings: TaskSettings index: 1
11575230 : Read settings: TaskSettings index: 2
11575234 : Read settings: TaskSettings index: 4
11575238 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
11575248 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26

Panic C:\Users\Tom\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.1\cores\esp8266\core_esp8266_main.cpp:99 __yield

ctx: cont 
sp: 3fff3e70 end: 3fff47c0 offset: 01b0

>>>stack>>>

But it should have gone more like this:

190496 : EVENT: NEXTION#idx=26.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 110: nextion#idx do
190542 : Read settings: TaskSettings index: 1
190548 : Read settings: TaskSettings index: 2
190551 : Read settings: TaskSettings index: 4
190555 : DEBUG DEV: Parsed String='  if 26'
190558 : Read settings: TaskSettings index: 1
190563 : Read settings: TaskSettings index: 2
190567 : Read settings: TaskSettings index: 4
190571 : DEBUG DEV: Parsed String='  if 26>=10 and 26'
RuleDebug: 111: if 26>=10 and 26<=30
190578 : [if 26>=10 and 26<=30]=true
190585 : Read settings: TaskSettings index: 1
190588 : Read settings: TaskSettings index: 2
190591 : Read settings: TaskSettings index: 4
190595 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
190604 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26
190622 : Command: publish
RuleDebug: 111: endif
190628 : Read settings: TaskSettings index: 1
190631 : Read settings: TaskSettings index: 2
190633 : Read settings: TaskSettings index: 4
190636 : DEBUG DEV: Parsed String='  if 26'
RuleDebug: 111: if 26>=500
190638 : [if 26>=500]=false
190641 : Read settings: TaskSettings index: 1
190645 : Read settings: TaskSettings index: 2
190650 : Read settings: TaskSettings index: 4
190653 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
190664 : Read settings: TaskSettings index: 1
190668 : Read settings: TaskSettings index: 2
190672 : Read settings: TaskSettings index: 4
190676 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/value,0'
RuleDebug: 111: publish /espez_laundry/nextion/value,0
RuleDebug: 111: endif
RuleDebug: 000: endon
RuleDebug: 100: nextion#idx=98 do
RuleDebug: 101: nextion,page7.t_wifi_ssid.txt="ssid: %ssid%"
RuleDebug: 101: nextion,page7.t_ip.txt="ip: %ip%"
RuleDebug: 101: nextion,page7.t_signal.txt="rssi: [rssi#signal]dbm"
RuleDebug: 101: nextion,page7.t_date.txt="date %sysmonth%:%sysday%:%sysyears%"
RuleDebug: 101: nextion,page7.t_time.txt="time %syshour%:%sysmin%:%syssec% "
RuleDebug: 101: nextion,page7.t_uptime.txt="uptime [runtime#days]days"
190729 : EVENT: NEXTION#idx=26.00 Processing time:233 milliSeconds
190735 : EVENT: NEXTION#value=0.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: nextion#idx do
RuleDebug: 101: if [nextion#idx]>=10 and [nextion#idx]<=30
RuleDebug: 101: publish /%sysname%/nextion/idx,[nextion#idx]
RuleDebug: 101: endif
RuleDebug: 101: if [nextion#idx]>=500
RuleDebug: 101: publish /%sysname%/nextion/idx,[nextion#idx]
RuleDebug: 101: publish /%sysname%/nextion/value,[nextion#value]
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: nextion#idx=98 do
RuleDebug: 101: nextion,page7.t_wifi_ssid.txt="ssid: %ssid%"
RuleDebug: 101: nextion,page7.t_ip.txt="ip: %ip%"
RuleDebug: 101: nextion,page7.t_signal.txt="rssi: [rssi#signal]dbm"
RuleDebug: 101: nextion,page7.t_date.txt="date %sysmonth%:%sysday%:%sysyears%"
RuleDebug: 101: nextion,page7.t_time.txt="time %syshour%:%sysmin%:%syssec% "
RuleDebug: 101: nextion,page7.t_uptime.txt="uptime [runtime#days]days"
190859 : EVENT: NEXTION#value=0.00 Processing time:124 milliSeconds
190865 : Read settings: ControllerSettings index: 0
190927 : MQTT : /ESPEZ_Laundry/NEXTION/idx 26
190928 : MQTT : /ESPEZ_Laundry/NEXTION/value 0
190930 : NEXTION075 : Pipe Command Sent: |s,i26,sOff26.00
190950 : Read settings: TaskSettings index: 1
190954 : SW   : State 0.00
190956 : EVENT: WASHER#ac=0.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
TD-er commented 6 years ago

How do you send the publish commands? Are they commands from a rule?

thomastech commented 6 years ago

Yes the publish actions are in the ESPEasy rule file. To further debug this I've commented them out and restarted my tests. If the reboot appears then the problem is not caused by MQTT publish.

Here's the rule excerpt:

on NEXTION#idx do
  if [NEXTION#idx]>=10 and [NEXTION#idx]<=30
      Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
  endif
  if [NEXTION#idx]>=500  // Touch Events
      Publish /%sysname%/NEXTION/idx,[NEXTION#idx]
      Publish /%sysname%/NEXTION/value,[NEXTION#value]
  endif
endon

I don't use Touch events (idx>=500), so only the first if statement applies to this issue.

thomastech commented 6 years ago

It's been running trouble free for four hours after removing the publish statements from ESPEasy's rule file. So I am back to testing with MQTT publish.

I looked at the core and found the panic function. There's no Serial.flush before the exception, so my theory is that we lose the last buffered serial log messages at the panic reboot. So there is a chance the logs are incomplete before the reboot. Just a theory.

I edited _C005.ino and added a couple more addLog statements with Serial.flush on both sides of the MQTTpublish(). So hopefully that helps push out the debug messages before a panic reboot.

I've restore the rule file's publish statements and I am now restarting the testing.

I shut down my home control system (openhab) to see what would happen to the unanswered MQTT. EasyESP doesn't care, no problems appeared.

Lastly, I've increased the Advanced->Tools->Controller Message Interval from 100mS to 250mS.

I'll collect new logs for a few hours. Maybe the new _C005.ino addLog messages will provide more debug insight.

TD-er commented 6 years ago

This "Controller Message Interval" variable is being used for multiple things and one of them is actually a blocking wait for that period. It does freeze the system for that amount of time. It is also used in places where another variable would make more sense instead of re-use.

So that's one thing I would like to change. It looks like it is meant to be used to give the MQTT broker some time to process requests, but I guess using a buffer at the controller side would make more sense.

This yield panic crash occurs when a yield or delay (which calls yield) is being used in a function that's being called from somewhere else. Normally all functions are being called from either setup() or loop(). But callback functions are called from the core, so their scope does not include access to the yield function to run background tasks.

Yesterday I split the rules processing function to make it more readable. See: https://github.com/TD-er/ESPEasy/commit/be7724d8c7c15582438af34877f100d63ad8efbe Today I start looking at ExecuteCommand to see if there is something that can be handled asynchronous via the scheduler to make it being called from the loop() again.

thomastech commented 6 years ago

@TD-er: Thanks for the explanation. I was aware that you were working on the Controller Message Interval.

My test run last night was interesting. Only one reboot, whereas I would have expected several. Could have been blind luck, or something about my altered test conditions contributed to stability.

But the overnight test's panic reboot provided another clue. Here's the log excerpt:

13901389 : Before MQTT
13901393 : After MQTT
13901395 : MQTT : /ESPEZ_Laundry/RSSI/signal -79
13905306 : NEXTION075 : code: |s,i26,sOff
13905308 : Read settings: TaskSettings index: 4
13905312 : EVENT: NEXTION#idx=26.00
RuleDebug Processing:rules1.txt
     flags CMI  parse output:
RuleDebug: 001: endon
RuleDebug: 100: washer#ac do
RuleDebug: 101: if [washer#ac]=1
RuleDebug: 101: nextion,page0.va_washerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_washerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 100: dryermov#detect do
RuleDebug: 101: if [dryermov#detect]=1
RuleDebug: 101: nextion,page0.va_dryerac.val=1
RuleDebug: 101: else
RuleDebug: 101: nextion,page0.va_dryerac.val=0
RuleDebug: 101: endif
RuleDebug: 000: endon
RuleDebug: 110: nextion#idx do
13905359 : Read settings: TaskSettings index: 1
13905368 : Read settings: TaskSettings index: 2
13905370 : Read settings: TaskSettings index: 4
13905372 : DEBUG DEV: Parsed String='  if 26'
13905375 : Read settings: TaskSettings index: 1
13905381 : Read settings: TaskSettings index: 2
13905385 : Read settings: TaskSettings index: 4
13905389 : DEBUG DEV: Parsed String='  if 26>=10 and 26'
RuleDebug: 111: if 26>=10 and 26<=30
13905396 : [if 26>=10 and 26<=30]=true
13905401 : Read settings: TaskSettings index: 1
13905406 : Read settings: TaskSettings index: 2
13905410 : Read settings: TaskSettings index: 4
13905414 : DEBUG DEV: Parsed String='      Publish /%sysname%/NEXTION/idx,26'
RuleDebug: 111: publish /espez_laundry/nextion/idx,26
13905424 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26
13905441 : Command: publish

Panic C:\Users\Tom\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.4.1\cores\esp8266\core_esp8266_main.cpp:99 __yield

ctx: cont 
sp: 3fff3e70 end: 3fff47c0 offset: 01b0

>>>stack>>>

The "Before MQTT" and "After MQTT" are addLog message wrappers I placed before and after the MQTTpublish() in _C005.ini. I only mention this because you are probably wondering what they are.

The interesting part is the Command: publish log message immediately before the panic reboot. It is log message that occurs inside misc.ino's ExecuteCommand().

And it's even more interesting is that you are starting to look at ExecuteCommand. So hopefully this rabbit hole journey is taking us in the right direction.

TD-er commented 6 years ago

This afternoon I added some commit just to make the commands return a String stating their return status. That has nothing really to do with this issue, but it gave me time to look at all the commands to see if there is something that cannot be done asynchronous. So I will now try to add an entry to schedule these command executions as a scheduled job. This will change the scheduled job a bit to keep the extra text line and the events in sync, but that's just a minor change.

Hopefully I find enough time this evening to implement it so you can test it. This time zone difference between us is nice to get some tests done :)

thomastech commented 6 years ago

Let me know when your changes are complete and need me to try them out.

TD-er commented 6 years ago

I did change some this morning, but there are still some issues with that, which I will fix.... now ;)

TD-er commented 6 years ago

I just pushed a fix to the existing pull request. Could you (stress) test it?

In short:

Still to do:

thomastech commented 6 years ago

I have it running now. Everything is working fine. I'll keep an eye out for reboots or other odd behavior and report back tomorrow (or sooner if something goes wrong tonight).

FWIW, the updated test branch doesn't seem to have the latest src files. For example, it did not have my recent Nextion.ino. I didn't check any other files for "freshness."

Thank you for your efforts to make the code more stable. It is appreciated.

thomastech commented 6 years ago

After running for about one hour it rebooted. I'll run it again with serial log enabled and capture the exception.

yondom commented 6 years ago

Sent from Yahoo Mail on Android

On Mon, Aug 6, 2018 at 8:25 PM, Thomasnotifications@github.com wrote:
After running for about one hour it rebooted. I'll run it again with serial log enabled and capture the exception.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the ythread.

TD-er commented 6 years ago

@thomastech Could you also test without actively using the publish method in the Nextion plugin? And maybe stress test it also, then I can merge this code and we can have a look at your Nextion code. I didn't have a proper look at it yet.

thomastech commented 6 years ago

@TD-er: My overnight run had 100% success. No reboots, still running (~12 hours). That's a record! I will continue to let it run and leave it alone.

I have a second NodeMCU running without publish in the rules. It has not rebooted, 15 hours so far. So the latest PR has not caused any new stability issues.

TD-er commented 6 years ago

And the 12-hour running version is using Nextion code that was involved before when it crashed?

thomastech commented 6 years ago

Yes, same Nextion code.

During this test I shut down my PC and left the device to run without anything messing with it, including me. Even when I don't touch it, there's plenty of MQTT actions being executed since the device constantly publishes status messages every 20 seconds.

I'm hoping that yesterday's reboot was unrelated to the rule's publish panic problem. But it's too early to know anything. It's best to watch and wait, then check the serial log next time it reboots.

In the meantime, it would be great if you reviewed the Nextion code. Although the panic reboot occurs while in the rules (not in Nextion), maybe something in the Nextion plugin is involved.

TD-er commented 6 years ago

It looks like it is the only plugin calling the ExecuteCommand function directly.

thomastech commented 6 years ago

@TD-er: I have to agree that ExecuteCommand is involved with this issue. If the Nextion module is the only one that directly calls it then that certainly makes it (Nextion) look like a bad guy.

The "12-hour" NodeMCU is still running OK. I've been randomly interacting with it to stress test the code and perhaps provoke a reboot.

thomastech commented 6 years ago

While writing that last comment it did a reboot. This time it was Exception (3) rather than a panic. But still a rule file publish, as before.

Here's a log excerpt leading up to the crash:

44981799 : MQTT : /ESPEZ_Laundry/RSSI/signal -78
44982911 : NEXTION075 : code: |s,i26,sOff
44982916 : EVENT: NEXTION#idx=26.00
44982947 : [if 26>=10 and 26<=30]=true
44982958 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26
44982984 : [if 26>=500]=false
44983015 : EVENT: NEXTION#idx=26.00 Processing time:99 milliSeconds
44983017 : EVENT: NEXTION#value=0.00
44983041 : EVENT: NEXTION#value=0.00 Processing time:24 milliSeconds
44983045 : NEXTION075 : Pipe Command Sent: |s,i26,sOff26.00
44983053 : Command: publish
44989810 : NEXTION075 : code: |s,i98,sOff
44989813 : EVENT: NEXTION#idx=98.00
44989849 : [if 98>=10 and 98<=30]=false
44989871 : [if 98>=500]=false
44989900 : ACT  : NEXTION,page7.t_wifi_ssid.txt="SSID: Wizard1_2G4"
44989951 : NEXTION075 : WRITE, Command is page7.t_wifi_ssid.txt="SSID: Wizard1_2G4"
44989956 : ACT  : NEXTION,page7.t_ip.txt="IP: 192.168.1.217"
44989998 : NEXTION075 : WRITE, Command is page7.t_ip.txt="IP: 192.168.1.217"
44990014 : ACT  : NEXTION,page7.t_signal.txt="RSSI: -78dBm"
44990055 : NEXTION075 : WRITE, Command is page7.t_signal.txt="RSSI: -78dBm"
44990061 : ACT  : NEXTION,page7.t_date.txt="Date 08:07:18"
44990102 : NEXTION075 : WRITE, Command is page7.t_date.txt="Date 08:07:18"
44990106 : ACT  : NEXTION,page7.t_time.txt="Time 08:35:16 "
44990147 : NEXTION075 : WRITE, Command is page7.t_time.txt="Time 08:35:16 "
44990162 : ACT  : NEXTION,page7.t_uptime.txt="Uptime 749.0days"
44990207 : NEXTION075 : WRITE, Command is page7.t_uptime.txt="Uptime 749.0days"
44990209 : EVENT: NEXTION#idx=98.00 Processing time:396 milliSeconds
44990211 : EVENT: NEXTION#value=0.00
44990236 : EVENT: NEXTION#value=0.00 Processing time:24 milliSeconds
44990239 : NEXTION075 : Pipe Command Sent: |s,i98,sOff98.00
44991053 : NEXTION075 : code: |s,i98,sOff
44991056 : EVENT: NEXTION#idx=98.00
44991088 : [if 98>=10 and 98<=30]=false
44991110 : [if 98>=500]=false
44991140 : ACT  : NEXTION,page7.t_wifi_ssid.txt="SSID: Wizard1_2G4"
44991190 : NEXTION075 : WRITE, Command is page7.t_wifi_ssid.txt="SSID: Wizard1_2G4"
44991196 : ACT  : NEXTION,page7.t_ip.txt="IP: 192.168.1.217"
44991239 : NEXTION075 : WRITE, Command is page7.t_ip.txt="IP: 192.168.1.217"
44991254 : ACT  : NEXTION,page7.t_signal.txt="RSSI: -78dBm"
44991299 : NEXTION075 : WRITE, Command is page7.t_signal.txt="RSSI: -78dBm"
44991304 : ACT  : NEXTION,page7.t_date.txt="Date 08:07:18"
44991344 : NEXTION075 : WRITE, Command is page7.t_date.txt="Date 08:07:18"
44991349 : ACT  : NEXTION,page7.t_time.txt="Time 08:35:18 "
44991390 : NEXTION075 : WRITE, Command is page7.t_time.txt="Time 08:35:18 "
44991404 : ACT  : NEXTION,page7.t_uptime.txt="Uptime 749.0days"
44991450 : NEXTION075 : WRITE, Command is page7.t_uptime.txt="Uptime 749.0days"
44991452 : EVENT: NEXTION#idx=98.00 Processing time:396 milliSeconds
44991454 : EVENT: NEXTION#value=0.00
44991481 : EVENT: NEXTION#value=0.00 Processing time:27 milliSeconds
44991484 : NEXTION075 : Pipe Command Sent: |s,i98,sOff98.00
44991501 : SW   : State 0.00
44991504 : EVENT: WASHER#ac=0.00
44991515 : [if 0=1]=false
44991518 : else = true
44991520 : ACT  : NEXTION,page0.va_WasherAC.val=0
44991552 : NEXTION075 : WRITE, Command is page0.va_WasherAC.val=0
44991572 : EVENT: WASHER#ac=0.00 Processing time:68 milliSeconds
44992044 : MQTT : /ESPEZ_Laundry/WASHER/ac 0
45001103 : SW   : State 0.00
45001106 : EVENT: WASHER#ac=0.00
45001117 : [if 0=1]=false
45001120 : else = true
45001121 : ACT  : NEXTION,page0.va_WasherAC.val=0
45001154 : NEXTION075 : WRITE, Command is page0.va_WasherAC.val=0
45001172 : EVENT: WASHER#ac=0.00 Processing time:67 milliSeconds
45001235 : MQTT : /ESPEZ_Laundry/WASHER/ac 0
45001274 : NEXTION075 : Cmd Statement Line-1 Sent: page0.j1.val=45
45001288 : SYS  : 750.00
45001292 : EVENT: RUNTIME#days=750.00
45001319 : EVENT: RUNTIME#days=750.00 Processing time:27 milliSeconds
45001748 : SYS  : -76.00
45001752 : EVENT: RSSI#signal=-76.00
45001776 : EVENT: RSSI#signal=-76.00 Processing time:24 milliSeconds
45002300 : MQTT : /ESPEZ_Laundry/RSSI/signal -76
45002309 : NEXTION075 : code: |s,i16,sOff
45002314 : EVENT: NEXTION#idx=16.00
45002348 : [if 16>=10 and 16<=30]=true
45002359 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,16
45002389 : [if 16>=500]=false
45002425 : EVENT: NEXTION#idx=16.00 Processing time:111 milliSeconds
45002426 : EVENT: NEXTION#value=0.00
45002451 : EVENT: NEXTION#value=0.00 Processing time:25 milliSeconds
45002807 : NEXTION075 : Pipe Command Sent: |s,i16,sOff16.00
45002816 : WD   : Uptime 750 ConnectFailures 0 FreeMem 14064
45002823 : Command: publish
45002901 : LoopStats: shortestLoop: 34 longestLoop: 1011174 avgLoopDuration: 59.82 loopCounterMax: 882352 loopCounterLast: 483258 countFindPluginId: 0
45002904 : PluginStats P_1_Switch input - Switch READ                 Count: 3 Avg/min/max 476.00/467/485 usec
45002913 : PluginStats P_1_Switch input - Switch ONCE_A_SECOND        Count: 30 Avg/min/max 18.73/11/30 usec
45002923 : PluginStats P_1_Switch input - Switch TEN_PER_SECOND       Count: 265 Avg/min/max 37.36/24/68 usec
45002933 : PluginStats P_1_Switch input - Switch WRITE                Count: 17 Avg/min/max 527.47/440/662 usec
45002942 : PluginStats P_1_Switch input - Switch EVENT_OUT            Count: 10 Avg/min/max 45.40/31/62 usec
45002952 : PluginStats P_1_Switch input - Switch FIFTY_PER_SECOND     Count: 1280 Avg/min/max 14.63/10/43 usec
45002962 : PluginStats P_26_Generic - System Info READ                 Count: 3 Avg/min/max 519.00/478/573 usec
45002972 : PluginStats P_26_Generic - System Info ONCE_A_SECOND        Count: 60 Avg/min/max 12.72/3/28 usec
45002981 : PluginStats P_26_Generic - System Info TEN_PER_SECOND       Count: 530 Avg/min/max 11.19/3/27 usec
45002991 : PluginStats P_26_Generic - System Info WRITE                Count: 4 Avg/min/max 10.25/3/19 usec
45003000 : PluginStats P_26_Generic - System Info EVENT_OUT            Count: 20 Avg/min/max 10.50/3/19 usec
45003010 : PluginStats P_26_Generic - System Info FIFTY_PER_SECOND     Count: 2560 Avg/min/max 7.72/2/23 usec
45003019 : PluginStats P_44_Gyro - MPU 6050 [TESTING] READ                 Count: 1 Avg/min/max 80.00/80/80 usec
45003029 : PluginStats P_44_Gyro - MPU 6050 [TESTING] ONCE_A_SECOND        Count: 30 Avg/min/max 2123.13/1951/3778 usec
45003040 : PluginStats P_44_Gyro - MPU 6050 [TESTING] TEN_PER_SECOND       Count: 265 Avg/min/max 15.00/11/19 usec
45003050 : PluginStats P_44_Gyro - MPU 6050 [TESTING] WRITE                Count: 17 Avg/min/max 17.71/14/21 usec
45003060 : PluginStats P_44_Gyro - MPU 6050 [TESTING] EVENT_OUT            Count: 10 Avg/min/max 15.10/3/22 usec
45003070 : PluginStats P_44_Gyro - MPU 6050 [TESTING] FIFTY_PER_SECOND     Count: 1280 Avg/min/max 10.84/7/19 usec
45003080 : PluginStats P_71_Display - Nextion [TESTING] READ                 Count: 1 Avg/min/max 21004.00/21004/21004 usec
45003091 : PluginStats P_71_Display - Nextion [TESTING] ONCE_A_SECOND        Count: 30 Avg/min/max 13.33/7/29 usec
45003101 : PluginStats P_71_Display - Nextion [TESTING] TEN_PER_SECOND       Count: 265 Avg/min/max 5784.80/83/502502 usec
45003111 : PluginStats P_71_Display - Nextion [TESTING] WRITE                Count: 17 Avg/min/max 33955.77/572/46868 usec
45003122 : PluginStats P_71_Display - Nextion [TESTING] EVENT_OUT            Count: 10 Avg/min/max 23.10/11/31 usec
45003132 : PluginStats P_71_Display - Nextion [TESTING] FIFTY_PER_SECOND     Count: 1280 Avg/min/max 4.32/3/27 usec
45003142 : Load File stats: Count: 144 Avg/min/max 2990.17/810/55416 usec
45003149 : Plugin call 50 p/s   stats: Count: 1280 Avg/min/max 1988.25/1879/2330 usec
45003156 : Plugin call 10 p/s   stats: Count: 265 Avg/min/max 7776.78/2020/504623 usec
45003164 : Plugin call 10 p/s U stats: Count: 265 Avg/min/max 3536.26/3126/4121 usec
45003172 : Plugin call  1 p/s   stats: Count: 30 Avg/min/max 5612.77/4338/33045 usec
45003179 : SensorSendTask()     stats: Count: 8 Avg/min/max 297917.50/3997/558267 usec
45003187 : sendData()           stats: Count: 10 Avg/min/max 381084.50/133337/552572 usec
45003194 : Compute formula      stats: Count: 6 Avg/min/max 693.83/15/1393 usec
45003202 : setNewTimerAt()      stats: Count: 1716 Avg/min/max 204.25/181/1176 usec
45003209 : timeDiff()           stats: Count: 1639431 - CPU cycles per call: 8.46
45003216 : Scheduler stats: (called/tasks/max_length/idle%) 483258/1709/13/68.90
45011103 : SW   : State 0.00
45011106 : EVENT: WASHER#ac=0.00
45011117 : [if 0=1]=false
45011120 : else = true
45011121 : ACT  : NEXTION,page0.va_WasherAC.val=0
45011155 : NEXTION075 : WRITE, Command is page0.va_WasherAC.val=0
45011174 : EVENT: WASHER#ac=0.00 Processing time:69 milliSeconds
45011235 : MQTT : /ESPEZ_Laundry/WASHER/ac 0
45012033 : NEXTION075 : code: |s,i26,sOff
45012039 : EVENT: NEXTION#idx=26.00
45012072 : [if 26>=10 and 26<=30]=true
45012083 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,26
45012109 : [if 26>=500]=false
45012143 : EVENT: NEXTION#idx=26.00 Processing time:103 milliSeconds
45012144 : EVENT: NEXTION#value=0.00
45012169 : EVENT: NEXTION#value=0.00 Processing time:25 milliSeconds
45012173 : NEXTION075 : Pipe Command Sent: |s,i26,sOff26.00
45012186 : Command: publish
45021104 : SW   : State 0.00
45021107 : EVENT: WASHER#ac=0.00
45021118 : [if 0=1]=false
45021121 : else = true
45021122 : ACT  : NEXTION,page0.va_WasherAC.val=0
45021155 : NEXTION075 : WRITE, Command is page0.va_WasherAC.val=0
45021174 : EVENT: WASHER#ac=0.00 Processing time:67 milliSeconds
45021236 : MQTT : /ESPEZ_Laundry/WASHER/ac 0
45021297 : SYS  : -78.00
45021301 : EVENT: RSSI#signal=-78.00
45021324 : EVENT: RSSI#signal=-78.00 Processing time:23 milliSeconds
45021798 : MQTT : /ESPEZ_Laundry/RSSI/signal -78
45022010 : NEXTION075 : code: |s,i16,sOff
45022016 : EVENT: NEXTION#idx=16.00
45022052 : [if 16>=10 and 16<=30]=true
45022064 : ACT  : Publish /ESPEZ_Laundry/NEXTION/idx,16

Exception (3):
epc1=0x401003e9 epc2=0x00000000 epc3=0x40000e4a excvaddr=0x4007c1f9 depc=0x00000000

ctx: cont 
sp: 3fff3bc0 end: 3fff4780 offset: 01a0

>>>stack>>>
3fff3d60:  3fff3668 000007f6 000007f6 4010020c  
3fff3d70:  ffffffff 3ffe91bc 3fff3e5c 3fff3668  
3fff3d80:  00000010 00000010 00000000 4010053d  
3fff3d90:  ffffffff 3ffeaca0 3fff3f9c 4021b777  
3fff3da0:  3fff3668 000007fb 000007fb 00000000  
3fff3db0:  00000010 3fff3e1c 3fff3e1c 4025f86f  
3fff3dc0:  3ffeaca0 00000000 3fff3e1c 4025f8bb  
3fff3dd0:  00000022 00000000 3fff3e1c 4025f8ed  
3fff3de0:  00000001 3fff4114 3fff3e1c 4025f93a  
3fff3df0:  3ffeaca0 00000010 00000000 00000000  
3fff3e00:  00000001 3fff4114 3fff3fac 4020caca  
3fff3e10:  00000000 00000001 3fff3e80 00000000  
3fff3e20:  00000000 00000000 00000000 4010053d  
3fff3e30:  00000008 00000000 3fff3e80 00000023  
3fff3e40:  3fff3660 000007ee 000007ee 0000002d  
3fff3e50:  0000000d 3fff4114 3fff3fac 4020cb36  
3fff3e60:  0000000d 3fff4114 3fff40a0 40225962  
3fff3e70:  3fff3668 000007ee 000007ee 4010020c  
3fff3e80:  3fff4114 00000000 3fff3fa8 40100690  
3fff3e90:  00000010 3fff40a0 3fff40a0 4025f814  
3fff3ea0:  3fff4114 3fff40a0 3fff40a0 40236d3c  
3fff3eb0:  00000000 00000000 00000000 00000000  
3fff3ec0:  00000000 00000000 49545845 692f4e4f  
3fff3ed0:  312c7864 00000036 00000060 40000000  
3fff3ee0:  00000000 00000700 00000000 00000000  
3fff3ef0:  3fff8c00 00000000 00000000 40100600  
3fff3f00:  00000000 00000000 00000700 00000007  
3fff3f10:  402a1058 00000006 00000000 3fff3668  
3fff3f20:  00000000 00000010 00000000 0000053d  
3fff3f30:  3fff8c00 00000022 3fff3f00 4025f860  
3fff3f40:  00000000 00000000 3fff3fb0 00000027  
3fff3f50:  c029f1cb 3fff3fb0 3fff3fb0 4025f86f  
3fff3f60:  4029f1cc 0000000b 3fff3fb0 00000027  
3fff3f70:  3fff3668 000007ee 000007ee 4010020c  
3fff3f80:  00000000 00000000 3fff3fbc 40100690  
3fff3f90:  00000000 00000000 3fff40a0 4025f814  
3fff3fa0:  00000000 00000000 00000000 00000000  
3fff3fb0:  00000000 00000000 00000000 00000000  
3fff3fc0:  00000000 00000000 00000000 00000000  
3fff3fd0:  00000000 00000000 00000000 0000002d  
3fff3fe0:  3fff4114 3fff40a0 0000000d 4021780d  
3fff3ff0:  00000600 4c5f1800 00000000 4e2f7901  
3fff4000:  00000000 00000010 00000000 00000000  
3fff4010:  00000000 00000000 3fff8b3c 0000000f  
3fff4020:  00000000 3fff8b54 0000000f 00000000  
3fff4030:  3fff8b6c 0000000f 00000000 3fff8be4  
3fff4040:  0000000f 00000000 3fff8bfc 0000000f  
3fff4050:  00000000 00000000 0000032d 4025f8bb  
3fff4060:  312c7864 00000036 3fff4114 4025f8ed  
3fff4070:  0000000b 00000000 3fff26d2 00000006  
3fff4080:  3fff417c 3fff4188 3fff243d 3fff4300  
3fff4090:  3fff417c 3fff4188 3fff243d 40242a20  
3fff40a0:  00000c00 00000000 00000000 40206c00  
3fff40b0:  00000000 00000010 00000000 00000000  
3fff40c0:  00000000 00000000 3fff8aec 0000000f  
3fff40d0:  00000000 3fff8b84 0000000f 00000000  
3fff40e0:  3fff8b9c 0000000f 00000000 3fff8bb4  
3fff40f0:  0000000f 00000000 3fff8bcc 0000000f  
3fff4100:  00000000 00000000 00000000 00000000  
3fff4110:  00000000 3fff8b04 0000002f 00000025  
3fff4120:  00000000 ffffffff 3fff4170 40100690  
3fff4130:  3fff427e 3fff427d 3fff4260 00000000  
3fff4140:  3fff427e 3fff427d 3fff4260 40242cfc  
3fff4150:  3fff427c 3fff427b 3fff427a 3fff4279  
3fff4160:  3fff4278 00000000 3fff4254 4025f8bb  
3fff4170:  00000000 00000000 00000000 3fff8a7c  
3fff4180:  0000002f 00000025 3fff8a44 0000002f  
3fff4190:  00000025 3fff8954 0000000f 00000000  
3fff41a0:  3fff8a0c 0000002f 00000025 4010020c  
3fff41b0:  3fff427c 3fff4300 00000001 3fff426c  
3fff41c0:  3ffeaca0 3fff0f64 3fff42b0 3ffeaca0  
3fff41d0:  00000009 3fff0f64 3fff42b0 40242f19  
3fff41e0:  3fff427b 3fff427a 3fff4279 3fff4278  
3fff41f0:  234e4f49 5d786469 20200a0d 69646e65  
3fff4200:  200a0d66 20666920 58454e5b 4e4f4954  
3fff4210:  78646923 353d3e5d 20203030 54202f2f  
3fff4220:  6863756f 65764520 2073746e 6e655328  
3fff4230:  40107b10 00000000 000003e8 4025ea81  
3fff4240:  3fff896c 3fff89ac 00000000 00000000  
3fff4250:  00000000 00000000 00000000 00000000  
3fff4260:  3fff89c4 0000003f 00000025 3fff8ac4  
3fff4270:  0000001f 0000001b 01010101 40010101  
3fff4280:  00000040 3fff4300 00000001 4025f9f0  
3fff4290:  3fff0031 3fff243d 00000001 02aefb40  
3fff42a0:  3fff4300 3fff243d 00000001 4024309b  
3fff42b0:  00000000 00000000 00000000 3fff893c  
3fff42c0:  0000000f 0000000a 3fff8924 0000000f  
3fff42d0:  0000000a 00000030 3fff4300 4025fe98  
3fff42e0:  3fff25c2 00000000 00000000 00000005  
3fff42f0:  00000010 00000004 00000000 40243c85  
3fff4300:  3fff853c 0000001f 00000011 4010020c  
3fff4310:  3fff23e8 3fff0e78 3fff4340 40100690  
3fff4320:  00000000 00000000 00000000 00000011  
3fff4330:  3fff1b8c 3fff26cc 3fff4640 40245559  
3fff4340:  00000000 00000000 00000000 00000011  
3fff4350:  3fff1b8c 3fff0e78 3fff4640 40265c0c  
3fff4360:  7b856b8a 3fff0e78 3fff4590 4026369c  
3fff4370:  3fff0030 3fff0e78 3fff45d8 00000011  
3fff4380:  3fff1b8c 3fff0e78 3fff4640 4024b323  
3fff4390:  692c737c 732c3631 0066664f 4010020c  
3fff43a0:  ffffffff 3ffe91bc 3fff43d0 40100690  
3fff43b0:  3fff3668 00000707 00000707 4010020c  
3fff43c0:  ffffffff 3ffe91bc 3fff44ac 40100690  
3fff43d0:  00000000 00000000 0000001f 40104ec9  
3fff43e0:  4000050c 3ffeaca0 3fff45ec 4021b777  
3fff43f0:  40000f83 00000030 00000000 ffffffff  
3fff4400:  4028679e 00000023 3fffbff8 00000e6b  
3fff4410:  000006db 00000e6a 00000001 3fff8920  
3fff4420:  00000080 3fff61e4 00000001 3fff4b8c  
3fff4430:  3fff3668 00000000 00007358 00000030  
3fff4440:  00000180 3fff26cc 3fff4628 40100690  
3fff4450:  00000000 00000000 3fff5c14 40214fc2  
3fff4460:  ffffd570 3fff26cc 3fff4600 3fff3668  
3fff4470:  00000020 00000020 00000000 3fff44b0  
3fff4480:  3fff1b24 3fff4488 3fff4690 40215084  
3fff4490:  3fff4490 3fff4490 3fff4498 3fff4498  
3fff44a0:  3fff44a0 3fff44a0 3fff44a8 3fff44a8  
3fff44b0:  3fff44b0 3fff44b0 3fff44b8 3fff44b8  
3fff44c0:  3fff44c0 3fff44c0 3fff44c8 3fff44c8  
3fff44d0:  3fff44d0 3fff44d0 3fff44d8 3fff44d8  
3fff44e0:  3fff44e0 3fff44e0 3fff44e8 401071ee  
3fff44f0:  3fff44f0 3fff44f0 3fff4640 4021053d  
3fff4500:  3fff4500 3fff4500 3fff4508 40245e9a  
3fff4510:  3fff4510 3fff4510 3fff4518 3fff4518  
3fff4520:  3fff4520 3fff4520 3fff4528 3fff4528  
3fff4530:  00000000 0000070c 0000070c 4010020c  
3fff4540:  3fff3668 0000070c 0000070c 00000004  
3fff4550:  3fff3670 00000000 3fff16f8 402867c4  
3fff4560:  00000014 00003838 00000707 40207b90  
3fff4570:  00000010 00000010 3fff1760 402901ab  
3fff4580:  4025e044 3fff4640 3fff1760 4025f9f0  
3fff4590:  00000000 00000000 00000000 4020b7b8  
3fff45a0:  3fff3668 00000707 00000707 4010020c  
3fff45b0:  3fff192c 3fff4640 3fff45e0 40100690  
3fff45c0:  3fff8524 0000000f 0000000b 3fff850c  
3fff45d0:  0000000f 00000000 3fff84f4 0000000f  
3fff45e0:  00000001 3fff84dc 0000000f 00000003  
3fff45f0:  3fff84c4 0000000f 00000002 3fff859c  
3fff4600:  0000001f 0000001e 00000000 4025f814  
3fff4610:  3fff43cf c000bc71 4029b2f8 4020bd3a  
3fff4620:  00000000 00000000 00000000 00000004  
3fff4630:  3fff192c 3fff4640 00000005 40217965  
3fff4640:  00000400 3fff1000 00000000 3fff4605  
3fff4650:  00000000 00000000 00000000 00000000  
3fff4660:  00000000 3fff4604 3fff82a4 0000000f  
3fff4670:  00000000 3fff8254 0000000f 00000000  
3fff4680:  3fff8634 0000000f 00000000 3fff82d4  
3fff4690:  0000000f 00000000 3fff8584 0000000f  
3fff46a0:  00000000 00000000 00000565 4020d7df  
3fff46b0:  02aefb98 3fff1b24 3fff6308 402156b8  
3fff46c0:  00000046 7b855899 3fff26d0 00000030  
3fff46d0:  7b8552fd 0000000b 02aefb98 02aefb34  
3fff46e0:  7b8553cb 3fff192c 3ffea808 40243689  
3fff46f0:  00000001 3fff1b04 00000002 40215781  
3fff4700:  02aefb98 3fff1b04 3fff5ce4 02aefb34  
3fff4710:  00000001 3fff1b04 3ffea808 40250793  
3fff4720:  00000000 00000000 3fff5ce4 4025087e  
3fff4730:  3fffdad0 00000000 3fff1914 3fff3750  
3fff4740:  3fffdad0 00000000 3fff1914 40250d95  
3fff4750:  00000000 00000000 00000001 4026088d  
3fff4760:  3fffdad0 00000000 3fff374a 402608b8  
3fff4770:  feefeffe feefeffe 3fff3760 40100710  
<<<stack<<<

 ets Jan  8 2013,rst cause:2, boot mode:(3,7)

load 0x4010f000, len 1384, room 16 
tail 8
chksum 0x2d
csum 0x2d
v614f7c32
~ld
ªU105 : 

INIT : Booting version: (custom) (ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3)
106 : INIT : Warm boot #2 - Restart Reason: Exception
110 : FS   : Mounting...
184 : FS   : Mount successful, used 77057 bytes of 2949250
227 : CRC  : No program memory checksum found. Check output of crc2.py
276 : CRC  : SecuritySettings CRC   ...OK 
553 : INIT : Free RAM:21280
554 : INIT : I2C
554 : INIT : SPI not enabled
568 : NEXTION075 : serial pin config RX:13, TX:15, Plugin Enabled
568 : NEXTION075 : Using software serial
581 : INFO : Plugins: 71 [Normal] [Testing] (ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3)
582 : EVENT: System#Wake
607 : WIFI : Set WiFi to STA
TD-er commented 6 years ago

Do you also happen to trace/log the memory usage?

thomastech commented 6 years ago

@TD-er: No, but I can do that. Do you want me to use the Arduino IDE monitor and install this : https://github.com/me-no-dev/EspExceptionDecoder

TD-er commented 6 years ago

That would be great. I probably can't decode your stack trace, since it is your own build

thomastech commented 6 years ago

If you can't decode it then won't the stack trace be a waste of time? If you mean you want me to decipher it then we're gonna need a bigger boat.

Stepping back a bit, this latest reboot is a bit out of the ordinary. That's because the most common crash on my build is a Panic with Restart Reason: Software Watchdog. But the latest crash/reboot (using your new PR) was an Exception 3 with Restart Reason: Exception. I've seen those before while debugging this problem, but they have been uncommon.

Google pointed to some discussions that suggested that sometimes power supply issues can cause this kind of crash. I've moved the NodeMCU off my PC's USB port and it's now running on the dedicated supply that I normally use.

Let's give it some time to see what happens. If the crash still occurs then the trace/log memory usage is next.

TD-er commented 6 years ago

Power supply can indeed cause these issues. But also a failed memory allocation, that's why I asked for tracking memory usage to see if it is decreasing over time

thomastech commented 6 years ago

@TD-er: Does ESPEasy have a built-in feature for tracking memory usage?

Or do I need to use a special tool like the one I mentioned earlier? Do you have a better recommendation that works with Arduino IDE?

TD-er commented 6 years ago

You can add a general sysinfo plugin and set it to log memory usage.

Just a espeasy plugin ;)

thomastech commented 6 years ago

Just a espeasy plugin ;)

That's the ticket. Installed and I'm now monitoring memory. Only 14K bytes and dips to 12K on occasion; Wiki says >20K is ideal.

thomastech commented 6 years ago

There was a reboot while I was watching the memory. Never got below 12K.

I'm moving it back to my PC so I can gather the next reboot's information with serial log.

TD-er commented 6 years ago

This recommendation of 20k is from before the new webserver we're now using. About a year ago, the entire web page was rendered in memory, which needed a String containing the entire page (up-to 11 k) Now we render the entire page in 400 bytes of memory :) (@s0170071 made this happen)

The number of included plugins also has impact on the amount of free memory. The left-over free memory also decreased over time with newer core libraries.

With "normal" builds, you get around 15k free. The events I buffer for the commands are also quite huge, which I should also have a look at. And we're still holding a lot in memory about the present plugins. A lot of that information can be loaded from flash at runtime. (flash is about as fast as RAM when reading on ESP8266)

thomastech commented 6 years ago

@TD-er: Thanks for the clarification on memory requirements. My conclusion is that the publish related crash/reboot isn't a dirty power or low memory issue.

Other than performing tests, I'm not much use to you. If you need me to do anything special then please let me know.

TD-er commented 6 years ago

Could you make it less hot over here? That would be great. :) They were pouring salt on the roads (something you normally see in winter), to protect them from melting.

So if you want to help... please do that, will you? ;)

TD-er commented 6 years ago

I made a build of the latest merges, maybe you could test one of those? It will be the same as tomorrow's, unless I will merge something else too ;)

thomastech commented 6 years ago

My USA state is on fire (literally). And also suffering from high temperature weather. So hot salty roads seem like a dream paradise to me.

Regarding your Generic HTTP Advanced crash #1536, my home automation system sometimes responds back to ESPEasy's publish actions using HTTP. It also uses HTTP messages to ESPEasy during remotely activated features that use Alexa voice actions.

I think it is important to mention that during the most recent crash I was testing some functions that involved HTTP command messages to ESPEasy. It was part of my stress testing.

EDIT: During last night's successful 12-hour run I didn't do any actions that would have involved a HTTP.

Does any of this sound like it could be related to your #1536. Is it possible there are multiple reasons for the crash/reboots and we're chasing all the symptoms like a cat chasing a laser pointer?

I made a build of the latest merges, maybe you could test one of those?

I will test your new build.

TD-er commented 6 years ago

@thomastech Hmm fires is indeed a bit worse. We (fortunately) only had a few and today a big one. @Grovkillen also had a lot of (really big) fires in Sweden. (they bombed one to fight the fire)

And I think the issue reported in #1536 indeed may also need to have a look at.

thomastech commented 6 years ago

Looks like we're all suffering in similar ways; Fixing software bugs is a welcome distraction.

I've flashed with your ESP_Easy_mega-20180804-11-g7783cf4_test_ESP8266_4096.bin. It reverted to AP mode and I had to reload config and rules. The "20180804" date in the name seems odd, so hopefully I used the correct file.

I'm doing the usual testing. If there's anything special you want me to do then let me know.

TD-er commented 6 years ago

I built it on my "Bash on Windows" environment using almost the same script as being used during the nightly builds. That one bases the name on the last tag and I just fetched the repo. 20180804 is the last tag.

thomastech commented 6 years ago

@TD-er: The latest test run lasted a little over 4 hours before a crash/reboot. This time it was Restart Reason: Hardware Watchdog See attached log, time 60189 for the start of the crash.

At the beginning of the test the memory was >14K bytes. It slowly decreased and towards the end it was hovering around 10K. The memory values are in the log file as entries marked with EVENT: FreeRam#ram. So besides the reboot issue there appears to be a memory leak.

log3.zip

thomastech commented 6 years ago

@TD-er: After the previous reboot I restarted the test. This time I disabled all logs, switched Nextion from softserial to UART serial, avoided all HTTP access to ESPEasy, and ran on a dedicated power supply. It's been running for >13 hours without any issues. The memory leak is gone too.

My second NodeMCU test system has been running for 40 hours. No memory leak. Log is disabled, Nextion is UART serial, MQTT is disabled, and no HTTP. This system is really not doing much activity, other than breathing. Not an ideal reference since it isn't being stressed. But still useful info.

I have not experienced a "Panic" reboot since installing the scheduler PR firmware. Going down the list of possible culprits to the recent reboots, I recall reading a post about stability issues caused by enabling serial log. Is that still a concern?

TD-er commented 6 years ago

Serial log may still act strange, when there is no reader. I am not sure what's happening then. It may be something stalling and thus causing a software/hardware watchdog reboot, or just filling memory. I don't know.

thomastech commented 6 years ago

@TD-er: That's good to know. I'm still suspicious of Serial log, so I'll do further investigation.

In the meantime I've turned Nextion Soft Serial back on. It's been running for about an hour. No reboots or memory leaks yet. I'll let it run for a few more hours and if all is well I'll move down the suspect list.

Lastly, you mentioned that Nextion is the only plugin that directly calls ExecuteCommand(). I had forgotten that early-on in my Nextion hardware serial update I removed that function because it appeared unnecessary. Was removing ExecuteCommand() a bad idea and perhaps the reason for all this trouble? Or accidental genius? :)

TD-er commented 6 years ago

Was removing ExecuteCommand() a bad idea and perhaps the reason for all this trouble? Or accidental genius? :)

I guess the best answer to that question is: "yes" ;)

thomastech commented 6 years ago

I guess the best answer to that question is: "yes" ;)

I see what you did there. I will ask my Magic 8-ball for a more decisive opinion.

The test system is still running (~17 hours) and no memory leaks. Yet.

thomastech commented 6 years ago

@TD-er: The test system is still running OK. Next I will turn on Serial log to see if it contributes to reboots or memory leaks.

I did an experiment with serial log on my second test device. It found evidence that the ESP8266's serial Tx buffer is being over-filled, which is something that can cause code blocking. My research (google) found that this is a situation that can cause a ESP8266 crash.

The experiment was simple. I added a Serial.availableForWrite() to the addToLog() function (in misc.ino). It checks to see how much room is left in the Tx buffer. If space is insufficient it trims the log string so that it can fit in the buffer (without causing a "full" condition).

This is the test code (EDIT: updated):

void addToLog(byte logLevel, const char *line)
{
  if (loglevelActiveFor(LOG_TO_SERIAL, logLevel)) {
    Serial.print(millis());
    Serial.print(F(" : "));
    if(strlen(line)+20 >  (uint16_t)(Serial.availableForWrite())) {
        uint16_t snip = (uint16_t)(Serial.availableForWrite());
        if(snip <22) snip = 22;
        String linestr = line;
        Serial.print(linestr.substring(0,snip-22));
        Serial.println(F(",snip!"));             // Sorry, but no room for your entire log message.
    }
    else {
        Serial.print(F("["));
        Serial.print(String(Serial.availableForWrite()));
        Serial.print(F("], "));
        Serial.println(line);
    }
  }
  if (loglevelActiveFor(LOG_TO_SYSLOG, logLevel)) {
    syslog(logLevel, line);
  }
  if (loglevelActiveFor(LOG_TO_WEBLOG, logLevel)) {
    Logging.add(logLevel, line);
  }

#ifdef FEATURE_SD
  if (loglevelActiveFor(LOG_TO_SDCARD, logLevel)) {
    File logFile = SD.open("log.dat", FILE_WRITE);
    if (logFile)
      logFile.println(line);
    logFile.close();
  }
#endif
}

I'm seeing periodic log entries that are being trimmed. So that means the native log message would have filled the buffer and caused a temporary block.

The log messages from the plugins are not a problem. The offending messages are from the system (PluginStats, Plugin call, etc.). Here's some examples, where trimmed messages end with snip!

20834 : [120], NEXTION075 : Cmd Statement Line-1 Sent: page0.j1.val=45
20856 : [120], SYS  : -79.00
20860 : [120], EVENT: RSSI#signal=-79.00
20880 : [120], EVENT: RSSI#signal=-79.00 Processing time:19 milliSeconds
20958 : [120], MQTT : /ESPEZ_Laundry/RSSI/signal -79
21094 : [120], SYS  : 15040.00
21096 : [116], EVENT: FreeRam#ram=15040.00
21117 : [120], EVENT: FreeRam#ram=15040.00 Processing time:21 milliSeconds
25449 : Current Time Zone:  DST time start: 2018-03-25 02:00:00 offset: -420 minSTD time start: 2018-10-28 02:00:00 offset: -480 min,snip!
25451 : EVENT: Time#Initialized,snip!
25473 : [120], EVENT: Time#Initialized Processing time:23 milliSeconds
25478 : [99], EVENT: Clock#Time=Wed,13:55
25498 : [120], EVENT: Clock#Time=Wed,13:55 Processing time:20 milliSeconds
26040 : [120], SYS  : 15152.00
26043 : [116], EVENT: FreeRam#ram=15152.00
26063 : [120], EVENT: FreeRam#ram=15152.00 Processing time:19 milliSeconds
30668 : [120], SW   : State 0.00
30670 : [114], EVENT: WASHER#ac=0.00
30678 : [120], [if 0=1]=false
30681 : [120], else = true
30682 : [109], ACT  : NEXTION1,page0.va_WasherAC.val=0
30714 : [120], NEXTION075 : WRITE, Command is page0.va_WasherAC.val=0
30732 : [120], EVENT: WASHER#ac=0.00 Processing time:62 milliSeconds
30793 : [120], MQTT : /ESPEZ_Laundry/WASHER/ac 0
31095 : [120], SYS  : 15040.00
31097 : [116], EVENT: FreeRam#ram=15040.00
31117 : [120], EVENT: FreeRam#ram=15040.00 Processing time:20 milliSeconds
31971 : [120], WD   : Uptime 1 ConnectFailures 0 FreeMem 15320
32863 : LoopStats: shortestLoop: 36 longestLoop: 1757031 avgLoopDuration: 54.40 loopCounterMax: 833333 loopCounterLast: 527198 countFindPluginId: 0,snip!
32865 : PluginStats P_1_Switch input - Switch READ                 Count: 3 Avg/min/max 592.67/591/594 usec,snip!
32875 : PluginStats P_1_Switch input - Switch ONCE_A_SECOND        Count: 30 Avg/min/max 22.80/15/66 usec,snip!
32885 : PluginStats P_1_Switch input - Switch TEN_PER_SECOND       Count: 270 Avg/min/max 42.77/32/68 usec,snip!
32895 : PluginStats P_1_Switch input - Switch WRITE                Count: 3 Avg/min/max 515.00/514/517 usec,snip!
32905 : PluginStats P_1_Switch input - Switch EVENT_OUT            Count: 10 Avg/min/max 45.50/35/51 usec,snip!
32915 : PluginStats P_1_Switch input - Switch FIFTY_PER_SECOND     Count: 1319 Avg/min/max 19.82/14/45 usec,snip!
32925 : PluginStats P_26_Generic - System Info READ                 Count: 7 Avg/min/max 698.14/552/748 usec,snip!
32935 : PluginStats P_26_Generic - System Info ONCE_A_SECOND        Count: 90 Avg/min/max 7.30/3/23 usec,snip!
32944 : PluginStats P_26_Generic - System Info TEN_PER_SECOND       Count: 810 Avg/min/max 7.13/3/23 usec,snip!
32954 : PluginStats P_26_Generic - System Info EVENT_OUT            Count: 30 Avg/min/max 7.43/2/19 usec,snip!
32964 : PluginStats P_26_Generic - System Info FIFTY_PER_SECOND     Count: 3957 Avg/min/max 6.05/2/23 usec,snip!
32974 : PluginStats P_71_Display - Nextion [TESTING] READ                 Count: 1 Avg/min/max 21051.00/21051/21051 usec,snip!
32985 : PluginStats P_71_Display - Nextion [TESTING] ONCE_A_SECOND        Count: 30 Avg/min/max 8.60/4/23 usec,snip!
32995 : PluginStats P_71_Display - Nextion [TESTING] TEN_PER_SECOND       Count: 270 Avg/min/max 88.16/78/124 usec,snip!
33006 : PluginStats P_71_Display - Nextion [TESTING] WRITE                Count: 3 Avg/min/max 29707.33/29694/29719 usec,snip!
33017 : PluginStats P_71_Display - Nextion [TESTING] EVENT_OUT            Count: 10 Avg/min/max 27.20/26/31 usec,snip!
33027 : PluginStats P_71_Display - Nextion [TESTING] FIFTY_PER_SECOND     Count: 1319 Avg/min/max 4.45/2/31 usec,snip!
33038 : Load File stats: Count: 20 Avg/min/max 22521.10/854/56695 usec,snip!
33044 : Plugin call 50 p/s   stats: Count: 1319 Avg/min/max 1864.93/1486/3664 usec,snip!
33052 : Plugin call 10 p/s   stats: Count: 270 Avg/min/max 1946.29/1615/2321 usec,snip!
33060 : Plugin call 10 p/s U stats: Count: 270 Avg/min/max 3395.21/2899/4695 usec,snip!
33068 : Plugin call  1 p/s   stats: Count: 30 Avg/min/max 64210.20/2101/1756522 usec,snip!
33076 : SensorSendTask()     stats: Count: 11 Avg/min/max 78003.37/24929/131677 usec,snip!
33084 : sendData()           stats: Count: 10 Avg/min/max 63515.10/24490/127186 usec,snip!
33092 : Compute formula      stats: Count: 10 Avg/min/max 139.00/10/1295 usec,snip!
33099 : setNewTimerAt()      stats: Count: 1759 Avg/min/max 187.72/158/283 usec,snip!
33106 : timeDiff()           stats: Count: 1666387 - CPU cycles per call: 8.44,snip!
33114 : Scheduler stats: (called/tasks/max_length/idle%) 527198/1751/12/74.60,snip!
36040 : [120], SYS  : 15152.00
36042 : [116], EVENT: FreeRam#ram=15152.00
36064 : [120], EVENT: FreeRam#ram=15152.00 Processing time:22 milliSeconds
40668 : [120], SW   : State 0.00
40671 : [114], EVENT: WASHER#ac=0.00
40679 : [120], [if 0=1]=false
40681 : [120], else = true
40683 : [109], ACT  : NEXTION1,page0.va_WasherAC.val=0
40717 : [120], NEXTION075 : WRITE, Command is page0.va_WasherAC.val=0
40734 : [120], EVENT: WASHER#ac=0.00 Processing time:63 milliSeconds
40795 : [120], MQTT : /ESPEZ_Laundry/WASHER/ac 0
40834 : [120], NEXTION075 : Cmd Statement Line-1 Sent: page0.j1.val=45
40857 : [120], SYS  : -77.00
40860 : [120], EVENT: RSSI#signal=-77.00
40882 : [120], EVENT: RSSI#signal=-77.00 Processing time:21 milliSeconds