letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.3k stars 2.22k forks source link

Manual reboot () due to hardware watchdog in release mega-20180922 #1774

Closed micropet closed 5 years ago

micropet commented 6 years ago

The current version still has the same problem as the versions of the last few weeks. I use ESP_Easy_mega-20180922_test_ESP8266_4096.bin

The units without sensors run only slightly longer than one or four hours.

Units with Sensors (BME280 BH1750 Pir MH-Z19 TVoc CSS811 PMS7003) boot after a few minutes. The running time is between 3 minutes and 30 minutes. It is always different.

TD-er commented 6 years ago

Do they have all these sensors at the same node?

micropet commented 6 years ago

Yes

TD-er commented 6 years ago

I just flashed a node with the SDS011 with this firmware, to see if serial I/O may be causing these issues. I will now flash another node with the BME280 and MH-Z19 to see if those are enough to get the same behavior. I noticed the BME280 plugin sometimes has a long time logged in my statistics, so that may be one of the culprits. Maybe you could disable that one as a test to see if it improves stability.

micropet commented 6 years ago

Because the PMS7003 still does not work, most units have an SDS021. So:

BME280 BH1750 Pir MH-Z19 TVoc CSS811 SDS021

So disable the BME280? Good, I'll do that.

micropet commented 6 years ago

But we still have a general problem, because even the units without connected sensors do not run long.

I believe that deactivating sensors does not help us.

First of all, the units would have to run without sensors for days or weeks.

Then you can gradually add sensors.

TD-er commented 6 years ago

On the other hand, the reported reboot intervals you report are way shorter than anyone else. So we should start to dig down somewhere.

Grovkillen commented 6 years ago

Yes, power is always a tricky issue to evaluate. I use 5V USB UPS on some of my units and they never reboot.

So could you give us info about the setup you're using?

v-a-d-e-r commented 6 years ago

My 16 nodes are all running fine now with all current changes for 1 day and 16 hours. No single reboot! :-) And I have all kind of sensors in use with usage of GPIO, I2C and HW serial....

micropet commented 6 years ago

OK. I control that. That is difficult. There are currently 15 units running on different power supplies.

Each unit has its own power supply. Each Wemos D1 has a 1800 μF capacitor to 3.3 V and a 1800 μF capacitor to 5 V.

I have always bought high-quality power supplies. (eg Aukey 2.4A with 48W power supply adapter, AUKEY USB C charger with 46W power, Volutz 60 Watt 12A 5V)

Grovkillen commented 6 years ago

Please do, but if they all are reporting reboots it may very well be a network issue as well?

Grovkillen commented 6 years ago

And just a curious question. Do they ALL have these capacitors? What if you remove those on one? As a test?

TD-er commented 6 years ago

The power sounds OK to me. Also those would not likely result in watchdog issues. Maybe I could lower the core library to 1.7.3, since the number of reported watchdog reboots has increased a lot after the update to 1.8.0. Not that they were not reported before, but the reports of those reboots is a lot more than before.

It could still be WiFi related.

micropet commented 6 years ago

I do not believe in a network problem. Currently, 51 WLAN devices are registered on both Unifi Access Points.

In the network are about 20 Wemos D1 with an old and simple, programmed by me software.

There are several LED drivers with 3-6 100 Watt LEDs connected to these units. I have been using this for years to switch the light in the apartment via PIR.

These units run for months without rebooting.

So, the same hardware I use for ESPEasy.

@Grovkillen Yes. All have this capacitors, also my own Units.

TD-er commented 6 years ago

What core lib do these other nodes use? Could be 2.3.x or older even?

And wifi related doesn't mean it is a problem in your accesspoint. Can also be something in the core libraries

micropet commented 6 years ago

@TD-er

No idea. The version is already several years old.

Because they work, I have not changed anything.

TD-er commented 6 years ago

I have just been looking at the uptime of my nodes. One of them is running for 42 days now and is running ESP_Easy_mega-20180513_normal_ESP8266_4096.bin 

So I will look into what core lib that was. It is core 2.4.1

micropet commented 6 years ago

Now we have 2.4.2?

TD-er commented 6 years ago

Yep, so simply changing to 2.4.1 in the platformio.ini could help. I can make a build with that for you to try if you like. What build version do you need? (normal/test and flash size)

micropet commented 6 years ago

Thank you Gijs. Is not necessary. I can change the core version myself and compile with platformio.

micropet commented 6 years ago

It may be a coincidence, but the unit with the BME280 disabled has been running for 5 hours now.

TD-er commented 6 years ago

That's also good news. I have a lot of those laying around, so that makes it easy for me to test. I also have those PMSx003, but then I have to fix that plugin first ;)

micropet commented 6 years ago

Yes, the PMSx003 plugin only worked for me for a few minutes. Then no more data comes.

The BME280 is quite important, I think. With temperature, pressure and humidity, I find no alternatives to a good price.

TD-er commented 6 years ago

It sure is and it is probably one of the more popular plugins. So I will have a look at it, to see why it appears to take up-to 1.5 seconds sometimes. At least that's what my statistics claim

micropet commented 6 years ago

Wau, thats much time.

TD-er commented 6 years ago

Just one of such lines in my stats dump: 5132780 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 53395.60/306/1588655 usec

So that is quite close to the (software) watchdog timeout and maybe the recent versions changed the hardware watchdog timeout to match the 2 sec.

micropet commented 6 years ago

That may well be possible. My unit is now running 6 hours 27 minutes without BME280. :)

Can you perhaps adjust something in the library (filter standby ...)?

TD-er commented 6 years ago

I found the bug in the BME280 plugin. Should be fixed in #1779

Just curious, is it still running fine with BME280 disabled?

The line I posted earlier now shows: 561184 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 542.13/385/3036 usec

micropet commented 6 years ago

Booted an hour ago. But it was about 7 hours online.

TD-er commented 6 years ago

You could check the latest merge with BME280 enabled

ShardanX commented 6 years ago

For testing here:

Now up and running for 6 hours, I'll watch and message again tomorrow. Hope that helps. Anything I should look at?

micropet commented 6 years ago

You could check the latest merge with BME280 enabled

Just flasht.

TD-er commented 6 years ago

@ShardanX I guess the problems with the BME280 manifest with more devices connected which may send data to the node. For example sensors sending data via serial without being requested for data. The BME280 was blocking for 1.5 seconds. This could give issues when buffers are being filled and lots of interrupts triggered which could extend this blocking time even more.

ShardanX commented 6 years ago

@ShardanX I guess the problems with the BME280 manifest with more devices connected which may send data to the node. For example sensors sending data via serial without being requested for data. The BME280 was blocking for 1.5 seconds. This could give issues when buffers are being filled and lots of interrupts triggered which could extend this blocking time even more.

Just added some more I²C and restarted both nodes. As it was said in the thread even without sensors it should reboot after some hours, up to now it was quiet here.

TD-er commented 6 years ago

I2C devices are "pull", so you have to request them to send you data. Some devices on serial also have a pull-like way of communicating (Modbus/Mbus), but others just push the data. That last group may interfere with the processes running on the CPU and thus extend processing time, or cause buffers to overflow when not dealt with in due time.

s0170071 commented 6 years ago

Just a note: I had WD issues recently when I was setting the serial baudrate to 9600 with debug messages on... @micropet your sensors are all I2C?

micropet commented 6 years ago

@s0170071 Not all. Only: BME280 BH1750 CSS811

TD-er commented 6 years ago

@micropet What is the "reboot interval" now, with the changed BME280 code?

micropet commented 6 years ago

ca. 5 hours

TD-er commented 6 years ago

So that's at least a factor 10 improvement from the start of this issue ;) Is the reboot reason always the same (Hardware Watchdog), or do you also get other reasons?

micropet commented 6 years ago

It is always the same: Hardware Watchdog And yes factor 10 - 15 better.

TD-er commented 6 years ago

Just a few of these iterations and we can call it stable, or at least very hard to reproduce ;)

micropet commented 6 years ago

very hard to reproduce - That's the way it is.

micropet commented 6 years ago

A big advantage of espeasy I see in it, that after a reboot the unit is functional again.

This is not always the case with my own buildings. I had to briefly disconnect from the supply voltage so they worked again.

Maybe there will be time for things like: PMS7003 or very important to me: PCA9685 values invert Currently, at a value of 0, the Led is at full power at 4092 the led is off.

ShardanX commented 6 years ago

FYI: Both of my nodes (see above) are running for about 48h without problem now.

micropet commented 6 years ago

I have 7 reboots since this morning. (Node with Sensors)

s0170071 commented 6 years ago

1795 :-)

micropet commented 6 years ago

Yes that would be nice. 1 or 2 days would be enough. ;)

s0170071 commented 6 years ago

I found something in my rules that may have caused these frequent crashes. It was a comment with a leading space:

on RelayOn do
 gpio,12,1
 // gpio,13,1
endon

There was a message in the logs about a not recognized command, don't exacly remember what it was. Can check this evening though. Runtime was always less than 4 hours, since I removed the comment line the log message is gone and the unit is now approaching 12 hours. Anyone else using rules ?

Grovkillen commented 6 years ago

I use rules and comments but never on a single line and never with a space.... I must test!

Grovkillen commented 6 years ago

I get this in the log but no crash:

465062 : Command:
465062 : Command unknown: ""

I use this rule to test:

on Rules#Timer=1 do
  Publish %sysname%/IP,%ip%
  Publish %sysname%/MAC,%mac%
  Publish %sysname%/AP,%bssid%
  // TaskValueSet 4,1,2
  // gpio,13,1
  TimerSet,1,10
endon