letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.29k stars 2.22k forks source link

What could cause a unit to freeze? #1436

Closed giig1967g closed 5 years ago

giig1967g commented 6 years ago

Hi, the issue is tricky. I have 10+ units running. Units with software prior to april 3rd have been working for months. Units with recent software are stable but sometimes randomly one or another freeze or reboot itself.

When the freeze happens, the unit just stop responding to webserver, ping, switch buttons. No update to LCD, OLED, etc. Also it doesn't switch to AP. Simply it's switched on (LCD displays latest content) but dead. This is the worst situation because you expect it to react to external triggers or send sensor update but it doesn't and the only solution when you notice it is to reboot physically the unit. And this could be very difficult if the units are in hidden places. If it matters, some units have MQTT Import plugin, others Encoder plugin, others LCD or OLED, others use InterESPeasy UDP. So no fixed pattern. All have static IP, if it matters.

Now, how can this be debugged? In which circustamces a unit could freeze?

The reboot issue is less problematic, as it recovers itself, but still it's annoying. Latest occurrence was at exactly noon (12:00:00) during idle operation without any timer set.

Any idea?

Grovkillen commented 6 years ago

Do all have UDP enabled?

giig1967g commented 6 years ago

no. not all. Some do, others don't.

Grovkillen commented 6 years ago

The ones that are rebooting/freeze... do they use UDP?

giig1967g commented 6 years ago

in the last 4 days, two units had UDP and one hadn't.

Grovkillen commented 6 years ago

Ok so my idea that a crowded UDP channel could be causing hickups is not valid.

TD-er commented 6 years ago

Try to disable logging to serial port. There seems to be an issue with the current core libs to reconnect to the fall-back AP, that one is on my radar :) There may also be some issues with network connections, which may be related to the recently reported UDP issues. Those have been changed only recently by the way.

giig1967g commented 6 years ago

serial port log disabled, but kept serial port enabled on all units. will keep 2 units with UDP and the others without. will see.

giig1967g commented 6 years ago

Here the mosquitto log of 2 units that freeze or reboot. Unit ESPT3: UDP enabled, MQTT IMPORT, static IP, OLED, Dummy Unit ESPT6: UDP enabled, MQTT IMPORT, static IP, LCD, Dummy

Some times in the mosquitto log of the broker (Openhab2 on raspberry Pi) I see disconnect error messages, but most of the times it doesn't mean that it froze.

from mosquitto.log:

**UNIT ESPT6 DIDN'T FROZE
Mon May 14 22:35:56 CEST 2018
1526330156: Client ESPClient_60:01:94:8E:BA:C9 has exceeded timeout, disconnecting.
1526330156: Socket error on client ESPClient_60:01:94:8E:BA:C9, disconnecting.
1526330161: Client ESPT6-Import         has exceeded timeout, disconnecting.
1526330161: Socket error on client ESPT6-Import        , disconnecting.
1526330186: New connection from 192.168.1.206 on port 1883.
1526330186: New client connected from 192.168.1.206 as ESPT6-Import         (c1, k15, u'openhabian').
1526330187: New connection from 192.168.1.206 on port 1883.
1526330187: New client connected from 192.168.1.206 as ESPClient_60:01:94:8E:BA:C9 (c1, k15, u'openhabian').

** UNIT ESPT6 FROZE at this exact time: Tue May 15 03:35:12 CEST 2018**
1526348112: Client ESPT6-Import         has exceeded timeout, disconnecting.
1526348112: Socket error on client ESPT6-Import        , disconnecting.
1526348116: Client ESPClient_60:01:94:8E:BA:C9 has exceeded timeout, disconnecting.
1526348116: Socket error on client ESPClient_60:01:94:8E:BA:C9, disconnecting.

**UNIT ESPT3 DIDN'T FROZE
Tue May 15 15:05:16 CEST 2018
1526389516: Client ESPT3-Import         has exceeded timeout, disconnecting.
1526389516: Socket error on client ESPT3-Import        , disconnecting.
1526389523: Client ESPClient_DC:4F:22:1D:82:43 has exceeded timeout, disconnecting.
1526389523: Socket error on client ESPClient_DC:4F:22:1D:82:43, disconnecting.
1526389524: New connection from 192.168.1.203 on port 1883.
1526389524: New client connected from 192.168.1.203 as ESPT3-Import         (c1, k15, u'openhabian').
1526389525: New connection from 192.168.1.203 on port 1883.
1526389525: New client connected from 192.168.1.203 as ESPClient_DC:4F:22:1D:82:43 (c1, k15, u'openhabian').
TD-er commented 6 years ago

They all use MQTT import? That one really should get a proper fix, since it really is a hack.

giig1967g commented 6 years ago

These two yes, they use MQTT Import. But another unit without MQTTimport and without UDP froze three days ago. This is puzzling me

giig1967g commented 6 years ago

This is the syslog of unit ESPT3 just before the MQTT IMPORT disconnect occurred. What is "sendcontent free"?

2018-05-15 15:04:59 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: EVENT: Clock#Time=Tue,15:05 Processing time:38 milliSeconds
2018-05-15 15:05:08 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: LoadFromFile: config.dat index: 4096 datasize: 472
2018-05-15 15:05:08 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: LoadFromFile: config.dat index: 5120 datasize: 512
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 13728 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12216 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11376 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12384 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11544 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11376 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11400 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12384 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11544 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11376 chunk size:400
2018-05-15 15:05:09 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12384 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11184 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12552 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 11712 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12184 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 12832 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 13352 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 13184 chunk size:400
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 13832 chunk size:219
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: sendcontent free: 13664 chunk size:0
2018-05-15 15:05:20 Kernel.Notice   192.168.1.203   ESPT3 EspEasy: EVENT: MQTTimport#Disconnected
2018-05-15 15:05:20 Kernel.Debug    192.168.1.203   ESPT3 EspEasy: EVENT: MQTTimport#Disconnected Processing time:38 milliSeconds
2018-05-15 15:05:20 Kernel.Error    192.168.1.203   ESPT3 EspEasy: IMPT : MQTT 037 Connection lost
2018-05-15 15:05:23 Kernel.Notice   192.168.1.203   ESPT3 EspEasy: IMPT : Connected to MQTT broker with Client ID=ESPT3-Import        
2018-05-15 15:05:23 Kernel.Notice   192.168.1.203   ESPT3 EspEasy: EVENT: MQTTimport#Connected
TD-er commented 6 years ago

This "sendcontent" is debug information to track memory usage per function. It looks like there was some kind of disconnect, but it was not logged what caused the disconnect. A few days ago I changed the WiFi client usage as a work-around for some bug in the core library. Maybe you could try the last version?

giig1967g commented 6 years ago

loaded the latest version.

giig1967g commented 6 years ago

Unit ESPT6 just froze. It happened while I was typing. I noticed a quick flash of the led on the board. LCD is still displaying the old values. No reaction, no ping, no AP. This unit has UDP enabled.

TD-er commented 6 years ago

Had that unit been rebooted/reset after being flashed? There is a bug (longstanding bug) the ESP8266 crashes when rebooting after it has been flashed. You have to perform a cold boot, or press reset once after flashing and then it will reboot just fine.

Still leaves the question why it reboots.

giig1967g commented 6 years ago

removed UDP and serial log. Cold reboot. Will keep monitoring.

giig1967g commented 6 years ago

I don't think that UDP or serial log is to blame: look at this unit with firmware 20180330: UDP enabled and serial log enabled: running for 47 days!

image

TD-er commented 6 years ago

One of my units even has "Uptime: 80 days 2 hours 17 minutes" But the inter-ESPeasy communications have been changed quite recently. And I think I changed to core 2.4.x after 20180330.

Not saying it is an UDP problem, just that dismissing it as a possibility is a bit too early.

s0170071 commented 6 years ago

@TD-er I got one up for 90 days and 14 hours. This version of the FW seems quite stable. @giig1967g things to check:

  1. is your wifi uninterrupted and coverage is sufficient ? I experience hanging units which are sitting in a low signal area of my house. They only reconnect if I reset them.
  2. do they have a common power supply ? Have there been outages in your area? I had that too a while ago. Was a faulty junction two streets from here. You have a solar inverter ? It would log this kind of events.
  3. monitor your free memory. Leaks happen.
  4. reset (factory reset) your unit and re configure. I mean it. wipe it. Especially after an upgrade, strange things may happen. #1447 Watch it for a while before you start enabling plugins.
  5. dont' sync / mqtt / send every second.
  6. you got the sleep feature enabled ?
Grovkillen commented 6 years ago

I got one node with 9 DS18B20 hooked up with a 5V USB UPS and it's running for as long as I do not update it. So I suspect the power supply is very important. Small spikes (up or down) might cause the ESP to reboot?

giig1967g commented 6 years ago

I have disabled UDP and serial log. Installed 20180516 firmware. Units have been running for 3.5 days without problems. So maybe it's solved.

giig1967g commented 6 years ago

I can't believe it. I was so happy.... :( My unit rebooted just now...after 3.5 days of work:

Unit | 6
-- | --
Local Time | 2018-05-20 10:35:15
Uptime | 0 days 0 hours 5 minutes
Load | 8% (LC=7029)
Free Mem | 15592 (14792 - sendWebPage)
Boot | Manual reboot (2)
Reset Reason | Exception

Build
20102  - Mega
Libraries
ESP82xx Core 2_4_1, NONOS SDK 2.2.1(cfd48f3), LWIP: 2.0.3
GIT version
mega-20180516
Plugins
47 [Normal]
Build time
May 16 2018 02:12:17
Binary filename
ESP_Easy_mega-20180516_normal_ESP8285_1024.bin
TD-er commented 6 years ago

What plugins do you have enabled? An exception is quite often related to some programming error.

giig1967g commented 6 years ago

Dummy, LCD, SystemInfo, MQTTImport.

image

s0170071 commented 6 years ago

Have you done the factory reset?

-------- Ursprüngliche Nachricht -------- Von: Plebs notifications@github.com Gesendet: 20. Mai 2018 10:52:40 MESZ An: letscontrolit/ESPEasy ESPEasy@noreply.github.com CC: s0170071 drk@posteo.de, Comment comment@noreply.github.com Betreff: Re: [letscontrolit/ESPEasy] What could cause a unit to freeze? (#1436)

Dummy, LCD, SystemInfo, MQTTImport.

image

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/letscontrolit/ESPEasy/issues/1436#issuecomment-390466966

giig1967g commented 6 years ago

no. You mean that I should reprogram every unit after every software upgrade?

s0170071 commented 6 years ago

No. Only if you updated it and have problems afterwards.

Grovkillen commented 6 years ago

No you shouldn't have to do that with a official stable release but now we're trying to pinpoint bugs and one way is to try a factory reset and see if behavior changes. If it does we get a hint where to look.

giig1967g commented 6 years ago

ok. will do.

Fusseldieb commented 6 years ago

Socket Errors still occurr every often (Every 1-2 days). Only a restart of the ESP resolve it. Webserver is still up, but MQTT is dead.

Maybe try to switch to another MQTT library? Example: https://github.com/marvinroger/async-mqtt-client

TD-er commented 6 years ago

@Fusseldieb What version are you running? What core version, what LWIP version, what ESPeasy version?

Even Tasmota is now switching to this version of MQTT client. See https://github.com/arendst/Sonoff-Tasmota/commit/04493965e4c14b43c53c5baef5798a8cdabb2272

TD-er commented 5 years ago

I know this is an old issue, but it is one of the oldest I could find regarding instability + MQTT And it is also one of the first ones after this commit: https://github.com/TD-er/ESPEasy/commit/0e8b333f482fd12f7a7c584b875015e2efe7e490

So let me explain my thoughts here... I came across this issue on PubSubClient: https://github.com/knolleary/pubsubclient/issues/589 This is about the destruction of a WiFiClient. On May 13th, I made a commit to re-init the WiFiClient as a workaround suggested here: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864

For the MQTT import, a new WiFiClient is created when there is no actual connection, but the client is not actively being set to the MQTT client.

So I would suggest to do something like this:

    if (!connected) {
      // workaround see: https://github.com/esp8266/Arduino/issues/4497#issuecomment-373023864
      if (MQTTclient_037 != NULL) {
        espclient_037 = WiFiClient();
        espclient_037.setTimeout(CONTROLLER_CLIENTTIMEOUT_DFLT);
        MQTTclient_037->setClient(espclient_037);
      }
      ++reconnectCount;
      addLog(LOG_LEVEL_ERROR, F("IMPT : MQTT 037 Connection lost"));
    }

(the last line is new + check for null pointer)

Maybe someone can already test this? Or someone has some ideas on this?

TD-er commented 5 years ago

This is no longer an issue, so it can be closed.