Closed micropet closed 5 years ago
Do they have all these sensors at the same node?
Yes
I just flashed a node with the SDS011 with this firmware, to see if serial I/O may be causing these issues. I will now flash another node with the BME280 and MH-Z19 to see if those are enough to get the same behavior. I noticed the BME280 plugin sometimes has a long time logged in my statistics, so that may be one of the culprits. Maybe you could disable that one as a test to see if it improves stability.
Because the PMS7003 still does not work, most units have an SDS021. So:
BME280 BH1750 Pir MH-Z19 TVoc CSS811 SDS021
So disable the BME280? Good, I'll do that.
But we still have a general problem, because even the units without connected sensors do not run long.
I believe that deactivating sensors does not help us.
First of all, the units would have to run without sensors for days or weeks.
Then you can gradually add sensors.
On the other hand, the reported reboot intervals you report are way shorter than anyone else. So we should start to dig down somewhere.
Yes, power is always a tricky issue to evaluate. I use 5V USB UPS on some of my units and they never reboot.
So could you give us info about the setup you're using?
My 16 nodes are all running fine now with all current changes for 1 day and 16 hours. No single reboot! :-) And I have all kind of sensors in use with usage of GPIO, I2C and HW serial....
OK. I control that. That is difficult. There are currently 15 units running on different power supplies.
Each unit has its own power supply. Each Wemos D1 has a 1800 μF capacitor to 3.3 V and a 1800 μF capacitor to 5 V.
I have always bought high-quality power supplies. (eg Aukey 2.4A with 48W power supply adapter, AUKEY USB C charger with 46W power, Volutz 60 Watt 12A 5V)
Please do, but if they all are reporting reboots it may very well be a network issue as well?
And just a curious question. Do they ALL have these capacitors? What if you remove those on one? As a test?
The power sounds OK to me. Also those would not likely result in watchdog issues. Maybe I could lower the core library to 1.7.3, since the number of reported watchdog reboots has increased a lot after the update to 1.8.0. Not that they were not reported before, but the reports of those reboots is a lot more than before.
It could still be WiFi related.
I do not believe in a network problem. Currently, 51 WLAN devices are registered on both Unifi Access Points.
In the network are about 20 Wemos D1 with an old and simple, programmed by me software.
There are several LED drivers with 3-6 100 Watt LEDs connected to these units. I have been using this for years to switch the light in the apartment via PIR.
These units run for months without rebooting.
So, the same hardware I use for ESPEasy.
@Grovkillen Yes. All have this capacitors, also my own Units.
What core lib do these other nodes use? Could be 2.3.x or older even?
And wifi related doesn't mean it is a problem in your accesspoint. Can also be something in the core libraries
@TD-er
No idea. The version is already several years old.
Because they work, I have not changed anything.
I have just been looking at the uptime of my nodes. One of them is running for 42 days now and is running ESP_Easy_mega-20180513_normal_ESP8266_4096.bin
So I will look into what core lib that was. It is core 2.4.1
Now we have 2.4.2?
Yep, so simply changing to 2.4.1 in the platformio.ini could help. I can make a build with that for you to try if you like. What build version do you need? (normal/test and flash size)
Thank you Gijs. Is not necessary. I can change the core version myself and compile with platformio.
It may be a coincidence, but the unit with the BME280 disabled has been running for 5 hours now.
That's also good news. I have a lot of those laying around, so that makes it easy for me to test. I also have those PMSx003, but then I have to fix that plugin first ;)
Yes, the PMSx003 plugin only worked for me for a few minutes. Then no more data comes.
The BME280 is quite important, I think. With temperature, pressure and humidity, I find no alternatives to a good price.
It sure is and it is probably one of the more popular plugins. So I will have a look at it, to see why it appears to take up-to 1.5 seconds sometimes. At least that's what my statistics claim
Wau, thats much time.
Just one of such lines in my stats dump:
5132780 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 53395.60/306/1588655 usec
So that is quite close to the (software) watchdog timeout and maybe the recent versions changed the hardware watchdog timeout to match the 2 sec.
That may well be possible. My unit is now running 6 hours 27 minutes without BME280. :)
Can you perhaps adjust something in the library (filter standby ...)?
I found the bug in the BME280 plugin. Should be fixed in #1779
Just curious, is it still running fine with BME280 disabled?
The line I posted earlier now shows:
561184 : PluginStats P_27_Environment - BMx280 ONCE_A_SECOND Count: 30 Avg/min/max 542.13/385/3036 usec
Booted an hour ago. But it was about 7 hours online.
You could check the latest merge with BME280 enabled
For testing here:
Now up and running for 6 hours, I'll watch and message again tomorrow. Hope that helps. Anything I should look at?
You could check the latest merge with BME280 enabled
Just flasht.
@ShardanX I guess the problems with the BME280 manifest with more devices connected which may send data to the node. For example sensors sending data via serial without being requested for data. The BME280 was blocking for 1.5 seconds. This could give issues when buffers are being filled and lots of interrupts triggered which could extend this blocking time even more.
@ShardanX I guess the problems with the BME280 manifest with more devices connected which may send data to the node. For example sensors sending data via serial without being requested for data. The BME280 was blocking for 1.5 seconds. This could give issues when buffers are being filled and lots of interrupts triggered which could extend this blocking time even more.
Just added some more I²C and restarted both nodes. As it was said in the thread even without sensors it should reboot after some hours, up to now it was quiet here.
I2C devices are "pull", so you have to request them to send you data. Some devices on serial also have a pull-like way of communicating (Modbus/Mbus), but others just push the data. That last group may interfere with the processes running on the CPU and thus extend processing time, or cause buffers to overflow when not dealt with in due time.
Just a note: I had WD issues recently when I was setting the serial baudrate to 9600 with debug messages on... @micropet your sensors are all I2C?
@s0170071 Not all. Only: BME280 BH1750 CSS811
@micropet What is the "reboot interval" now, with the changed BME280 code?
ca. 5 hours
So that's at least a factor 10 improvement from the start of this issue ;) Is the reboot reason always the same (Hardware Watchdog), or do you also get other reasons?
It is always the same: Hardware Watchdog And yes factor 10 - 15 better.
Just a few of these iterations and we can call it stable, or at least very hard to reproduce ;)
very hard to reproduce - That's the way it is.
A big advantage of espeasy I see in it, that after a reboot the unit is functional again.
This is not always the case with my own buildings. I had to briefly disconnect from the supply voltage so they worked again.
Maybe there will be time for things like: PMS7003 or very important to me: PCA9685 values invert Currently, at a value of 0, the Led is at full power at 4092 the led is off.
FYI: Both of my nodes (see above) are running for about 48h without problem now.
I have 7 reboots since this morning. (Node with Sensors)
Yes that would be nice. 1 or 2 days would be enough. ;)
I found something in my rules that may have caused these frequent crashes. It was a comment with a leading space:
on RelayOn do
gpio,12,1
// gpio,13,1
endon
There was a message in the logs about a not recognized command, don't exacly remember what it was. Can check this evening though. Runtime was always less than 4 hours, since I removed the comment line the log message is gone and the unit is now approaching 12 hours. Anyone else using rules ?
I use rules and comments but never on a single line and never with a space.... I must test!
I get this in the log but no crash:
465062 : Command:
465062 : Command unknown: ""
I use this rule to test:
on Rules#Timer=1 do
Publish %sysname%/IP,%ip%
Publish %sysname%/MAC,%mac%
Publish %sysname%/AP,%bssid%
// TaskValueSet 4,1,2
// gpio,13,1
TimerSet,1,10
endon
The current version still has the same problem as the versions of the last few weeks. I use ESP_Easy_mega-20180922_test_ESP8266_4096.bin
The units without sensors run only slightly longer than one or four hours.
Units with Sensors (BME280 BH1750 Pir MH-Z19 TVoc CSS811 PMS7003) boot after a few minutes. The running time is between 3 minutes and 30 minutes. It is always different.