Closed giig1967g closed 5 years ago
Do all have UDP enabled?
no. not all. Some do, others don't.
The ones that are rebooting/freeze... do they use UDP?
in the last 4 days, two units had UDP and one hadn't.
Ok so my idea that a crowded UDP channel could be causing hickups is not valid.
Try to disable logging to serial port. There seems to be an issue with the current core libs to reconnect to the fall-back AP, that one is on my radar :) There may also be some issues with network connections, which may be related to the recently reported UDP issues. Those have been changed only recently by the way.
serial port log disabled, but kept serial port enabled on all units. will keep 2 units with UDP and the others without. will see.
Here the mosquitto log of 2 units that freeze or reboot. Unit ESPT3: UDP enabled, MQTT IMPORT, static IP, OLED, Dummy Unit ESPT6: UDP enabled, MQTT IMPORT, static IP, LCD, Dummy
Some times in the mosquitto log of the broker (Openhab2 on raspberry Pi) I see disconnect error messages, but most of the times it doesn't mean that it froze.
from mosquitto.log:
**UNIT ESPT6 DIDN'T FROZE
Mon May 14 22:35:56 CEST 2018
1526330156: Client ESPClient_60:01:94:8E:BA:C9 has exceeded timeout, disconnecting.
1526330156: Socket error on client ESPClient_60:01:94:8E:BA:C9, disconnecting.
1526330161: Client ESPT6-Import has exceeded timeout, disconnecting.
1526330161: Socket error on client ESPT6-Import , disconnecting.
1526330186: New connection from 192.168.1.206 on port 1883.
1526330186: New client connected from 192.168.1.206 as ESPT6-Import (c1, k15, u'openhabian').
1526330187: New connection from 192.168.1.206 on port 1883.
1526330187: New client connected from 192.168.1.206 as ESPClient_60:01:94:8E:BA:C9 (c1, k15, u'openhabian').
** UNIT ESPT6 FROZE at this exact time: Tue May 15 03:35:12 CEST 2018**
1526348112: Client ESPT6-Import has exceeded timeout, disconnecting.
1526348112: Socket error on client ESPT6-Import , disconnecting.
1526348116: Client ESPClient_60:01:94:8E:BA:C9 has exceeded timeout, disconnecting.
1526348116: Socket error on client ESPClient_60:01:94:8E:BA:C9, disconnecting.
**UNIT ESPT3 DIDN'T FROZE
Tue May 15 15:05:16 CEST 2018
1526389516: Client ESPT3-Import has exceeded timeout, disconnecting.
1526389516: Socket error on client ESPT3-Import , disconnecting.
1526389523: Client ESPClient_DC:4F:22:1D:82:43 has exceeded timeout, disconnecting.
1526389523: Socket error on client ESPClient_DC:4F:22:1D:82:43, disconnecting.
1526389524: New connection from 192.168.1.203 on port 1883.
1526389524: New client connected from 192.168.1.203 as ESPT3-Import (c1, k15, u'openhabian').
1526389525: New connection from 192.168.1.203 on port 1883.
1526389525: New client connected from 192.168.1.203 as ESPClient_DC:4F:22:1D:82:43 (c1, k15, u'openhabian').
They all use MQTT import? That one really should get a proper fix, since it really is a hack.
These two yes, they use MQTT Import. But another unit without MQTTimport and without UDP froze three days ago. This is puzzling me
This is the syslog of unit ESPT3 just before the MQTT IMPORT disconnect occurred. What is "sendcontent free"?
2018-05-15 15:04:59 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: EVENT: Clock#Time=Tue,15:05 Processing time:38 milliSeconds
2018-05-15 15:05:08 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: LoadFromFile: config.dat index: 4096 datasize: 472
2018-05-15 15:05:08 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: LoadFromFile: config.dat index: 5120 datasize: 512
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 13728 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12216 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11376 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12384 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11544 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11376 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11400 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12384 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11544 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11376 chunk size:400
2018-05-15 15:05:09 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12384 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11184 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12552 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 11712 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12184 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 12832 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 13352 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 13184 chunk size:400
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 13832 chunk size:219
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: sendcontent free: 13664 chunk size:0
2018-05-15 15:05:20 Kernel.Notice 192.168.1.203 ESPT3 EspEasy: EVENT: MQTTimport#Disconnected
2018-05-15 15:05:20 Kernel.Debug 192.168.1.203 ESPT3 EspEasy: EVENT: MQTTimport#Disconnected Processing time:38 milliSeconds
2018-05-15 15:05:20 Kernel.Error 192.168.1.203 ESPT3 EspEasy: IMPT : MQTT 037 Connection lost
2018-05-15 15:05:23 Kernel.Notice 192.168.1.203 ESPT3 EspEasy: IMPT : Connected to MQTT broker with Client ID=ESPT3-Import
2018-05-15 15:05:23 Kernel.Notice 192.168.1.203 ESPT3 EspEasy: EVENT: MQTTimport#Connected
This "sendcontent" is debug information to track memory usage per function. It looks like there was some kind of disconnect, but it was not logged what caused the disconnect. A few days ago I changed the WiFi client usage as a work-around for some bug in the core library. Maybe you could try the last version?
loaded the latest version.
Unit ESPT6 just froze. It happened while I was typing. I noticed a quick flash of the led on the board. LCD is still displaying the old values. No reaction, no ping, no AP. This unit has UDP enabled.
Had that unit been rebooted/reset after being flashed? There is a bug (longstanding bug) the ESP8266 crashes when rebooting after it has been flashed. You have to perform a cold boot, or press reset once after flashing and then it will reboot just fine.
Still leaves the question why it reboots.
removed UDP and serial log. Cold reboot. Will keep monitoring.
I don't think that UDP or serial log is to blame: look at this unit with firmware 20180330: UDP enabled and serial log enabled: running for 47 days!
One of my units even has "Uptime: 80 days 2 hours 17 minutes" But the inter-ESPeasy communications have been changed quite recently. And I think I changed to core 2.4.x after 20180330.
Not saying it is an UDP problem, just that dismissing it as a possibility is a bit too early.
@TD-er I got one up for 90 days and 14 hours. This version of the FW seems quite stable. @giig1967g things to check:
I got one node with 9 DS18B20 hooked up with a 5V USB UPS and it's running for as long as I do not update it. So I suspect the power supply is very important. Small spikes (up or down) might cause the ESP to reboot?
I have disabled UDP and serial log. Installed 20180516 firmware. Units have been running for 3.5 days without problems. So maybe it's solved.
I can't believe it. I was so happy.... :( My unit rebooted just now...after 3.5 days of work:
Unit | 6
-- | --
Local Time | 2018-05-20 10:35:15
Uptime | 0 days 0 hours 5 minutes
Load | 8% (LC=7029)
Free Mem | 15592 (14792 - sendWebPage)
Boot | Manual reboot (2)
Reset Reason | Exception
Build
20102 - Mega
Libraries
ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3
GIT version
mega-20180516
Plugins
47 [Normal]
Build time
May 16 2018 02:12:17
Binary filename
ESP_Easy_mega-20180516_normal_ESP8285_1024.bin
What plugins do you have enabled? An exception is quite often related to some programming error.
Dummy, LCD, SystemInfo, MQTTImport.
Have you done the factory reset?
-------- Ursprüngliche Nachricht -------- Von: Plebs notifications@github.com Gesendet: 20. Mai 2018 10:52:40 MESZ An: letscontrolit/ESPEasy ESPEasy@noreply.github.com CC: s0170071 drk@posteo.de, Comment comment@noreply.github.com Betreff: Re: [letscontrolit/ESPEasy] What could cause a unit to freeze? (#1436)
Dummy, LCD, SystemInfo, MQTTImport.
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/letscontrolit/ESPEasy/issues/1436#issuecomment-390466966
no. You mean that I should reprogram every unit after every software upgrade?
No. Only if you updated it and have problems afterwards.
No you shouldn't have to do that with a official stable release but now we're trying to pinpoint bugs and one way is to try a factory reset and see if behavior changes. If it does we get a hint where to look.
ok. will do.
Socket Errors still occurr every often (Every 1-2 days). Only a restart of the ESP resolve it. Webserver is still up, but MQTT is dead.
Maybe try to switch to another MQTT library? Example: https://github.com/marvinroger/async-mqtt-client
@Fusseldieb What version are you running? What core version, what LWIP version, what ESPeasy version?
Even Tasmota is now switching to this version of MQTT client. See https://github.com/arendst/Sonoff-Tasmota/commit/04493965e4c14b43c53c5baef5798a8cdabb2272
I know this is an old issue, but it is one of the oldest I could find regarding instability + MQTT And it is also one of the first ones after this commit: https://github.com/TD-er/ESPEasy/commit/0e8b333f482fd12f7a7c584b875015e2efe7e490
So let me explain my thoughts here...
I came across this issue on PubSubClient: https://github.com/knolleary/pubsubclient/issues/589
This is about the destruction of a WiFiClient
. On May 13th, I made a commit to re-init the WiFiClient as a workaround suggested here: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864
For the MQTT import, a new WiFiClient is created when there is no actual connection, but the client is not actively being set to the MQTT client.
So I would suggest to do something like this:
if (!connected) {
// workaround see: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864
if (MQTTclient_037 != NULL) {
espclient_037 = WiFiClient();
espclient_037.setTimeout(CONTROLLER_CLIENTTIMEOUT_DFLT);
MQTTclient_037->setClient(espclient_037);
}
++reconnectCount;
addLog(LOG_LEVEL_ERROR, F("IMPT : MQTT 037 Connection lost"));
}
(the last line is new + check for null pointer)
Maybe someone can already test this? Or someone has some ideas on this?
This is no longer an issue, so it can be closed.
Hi, the issue is tricky. I have 10+ units running. Units with software prior to april 3rd have been working for months. Units with recent software are stable but sometimes randomly one or another freeze or reboot itself.
When the freeze happens, the unit just stop responding to webserver, ping, switch buttons. No update to LCD, OLED, etc. Also it doesn't switch to AP. Simply it's switched on (LCD displays latest content) but dead. This is the worst situation because you expect it to react to external triggers or send sensor update but it doesn't and the only solution when you notice it is to reboot physically the unit. And this could be very difficult if the units are in hidden places. If it matters, some units have MQTT Import plugin, others Encoder plugin, others LCD or OLED, others use InterESPeasy UDP. So no fixed pattern. All have static IP, if it matters.
Now, how can this be debugged? In which circustamces a unit could freeze?
The reboot issue is less problematic, as it recovers itself, but still it's annoying. Latest occurrence was at exactly noon (12:00:00) during idle operation without any timer set.
Any idea?