letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

Losing communication with the mqtt broker causes a delay in processing gpio events. #5075

Closed tabunslonopotamov closed 2 months ago

tabunslonopotamov commented 3 months ago

My espeasy board scripts control the gate drive, including processing stops based on limit switch signals.

By analyzing the logs, I discovered that somewhere between the messages Broker C005 connection failed (1/0) and Broker C005 connection failed (3/0), the procedure for processing the GPIO trailer event occurs with significant delays (more than 1 sec).

On the "Advanced" settings page, the "Enable RTOS Multitasking" option is grayed out (blocked from being enabled).

Tell me what can be done? The motor jams because the limit switch signal is processed with an unacceptable delay.

Is it possible to set a small tcp connection timeout around 30-40ms?

TD-er commented 3 months ago

RTOS multitasking is something I tried a long time ago and back then it wasn't usable as I needed to add quite a lot of locking which is not yet done in the ESPEasy code.

However the WiFi code itself (low level code, part of the SDK) is running in a separate RTOS task on the 2nd core of the ESP32 (when available, otherwise on the 1st core, but still in a separate RTOS task)

Still the Arduino code called to connect some host and also DNS lookup is blocking. Meaning right now there is little I can do to prevent such stalls other than maybe try to run those calls on a separate RTOS task. However that may involve quite a lot of work as I have no clue what issues will come up in this refactor.

But even if this is split into a separate RTOS task, you still have to have real-time behavior which isn't guaranteed when running rules. (or handling other sensors as well)

The best you can do, is change a bit in the wiring.

For example if you enter a limit switch, the motor should stop no matter what. Easiest way when using a DC motor is to place the limit switch in series with the motor and put some diode over it so you can only back off to get out of this limit switch.

Something like this (image found via Google) image

Or if you can't use a diode for whatever reason, you could use something like this: image

tabunslonopotamov commented 3 months ago

Unfortunately, using hardware locks is difficult in my case.

Is it possible to come up with something to reduce the impact of mqtt server unavailability on latency? for example, so that the TCP connection timeout simply occurs after 30-40ms?

TD-er commented 3 months ago

The timeout as set in the controller configuration is used as timeout when trying to attempt a connection. However I have seen that this timeout may actually be longer.

But I do strongly advice you to look into some failsafe hardware solution to turn off the motor. What if the ESP is crashing? Or a wire between the ESP or H-bridge gets loose? Or the link between ESP and limit switch gets broken? Or the H-bridge gets damaged and continuously will send current through the motor?

Having some (emergency) limit switch which will turn off the power to the motor is always good to have. You can also use a combination of a soft limit switch which will tell the ESP to turn off the motor and a hard limit switch which will cut power to the motor. There will always be some delay between hitting the soft limit switch and the motor is actually stopped. (it will not stop immediately when you cut power) So there is already some slack space to move after the soft limit switch. Why not add a hard limit switch (with diode to be able to get out again) where it might otherwise damage the motor or other parts of whatever is driven by the motor?

Not saying it isn't a problem in ESPEasy, or that I will not look into it, but there is no guaranteed minimum response time to act on the switch. So better add something extra for the emergency stop.

tabunslonopotamov commented 3 months ago

The system uses an industrial frequency regulator motor controller with control via a rs485 interface and keep-alive events from ESP32, in the absence of which an emergency stop of the drive is performed.

We can connect hardware limit switches to the drive, but now it’s difficult for me to do this, from the difficulties in disassembling the installation casings to the correct programming of the frequency regulator inputs.

We need at least some temporary solution, because it is impossible to take the equipment out of operation, and due to the onset of hot weather, the drive shutdown time decreased and began to coincide with the next attempt to connect to the broker.

That is, the scripts detect a break in communication with the broker and initiate the opening of the gate, and when the time comes for the limit switch event to fire, there is a delay due to network operations.

And as a result, every minor network failure causes the gate to be tightly blocked. And there are about 1,500 passages through them per day.

Maybe you can somehow increase the interval of attempts to connect to the broker to 90 seconds, for example?

TD-er commented 3 months ago

Is the ESP also processing relative high frequency pulse inputs?

Can you draw schematically what the ESP has as inputs and what as outputs? Also what is the time critical part? Maybe we can find some other way to make it way more realtime responsive.

tabunslonopotamov commented 3 months ago

There are no high frequency signals.

Beam speed 0.25m/s

The inductive hall sensor detects whether the beam extends beyond the working area. The power reserve before a collision is about 0.1m or 0.4 seconds including the full time for processing the event.

TD-er commented 3 months ago

But what has the ESP to do to stop the motor?

tabunslonopotamov commented 3 months ago
on swOpenLimit#State do
  if [swOpenLimit#State]=1 and [gate#state]=15 # if gate state is "opening" and end sensor raise to high
      sp_whex,01101870000204080000005CEB  # write hex command to rs-485
      TaskValueSetAndRun 10,2,13  # set gate state to "open"
  endif
tonhuisman commented 3 months ago

Comments should be prefixed with //, not #, that might confuse the rules parser... 🤔

Not sure where you get support for that sp_whex command, as that's not supported by ESPEasy.

TD-er commented 3 months ago

Are you sure this is about the MQTT reconnect? Maybe you can enable collecting timing stats and then let it run for a while to see what has the biggest impact. I do think it might be a WiFi scan when you need to reconnect to WiFi after a while. Such a scan may take upto 6 seconds, so it is more likely to cause said delays. However if this is what is causing these issues, then that's fixable. Still needs some coding, but it can be made non blocking or at least hardly blocking (scanning per channel for example)

And if the WiFI connection is so bad at that location, why not considering to use Ethernet? You can even look into PoE which makes powering also easier.

tabunslonopotamov commented 3 months ago

Not sure where you get support for that sp_whex command, as that's not supported by ESPEasy.

There are no problems with this command. The problem is that the raise gpio event begins to be processed with a delay of up to 5 seconds when trying to reconnect to mqtt.

sp_whex it shot alias for "serialproxy_writehex" from _P087_SerialProxy.ino which, in newer builds, may have been changed to serialproxy_writemix

      if (cmd.equalsIgnoreCase(F("serialproxy_writehex")) || cmd.equalsIgnoreCase(F("sp_whex"))) {
        P087_data_struct *P087_data =
          static_cast<P087_data_struct *>(getPluginTaskData(event->TaskIndex));

        if ((nullptr != P087_data)) {
          String param1 = parseStringKeepCase(string, 2);
          parseSystemVariables(param1, false);
          P087_data->sendStringhex(param1);
          addLogMove(LOG_LEVEL_INFO, param1);
          success = true;
        }
      }
tabunslonopotamov commented 3 months ago

Are you sure this is about the MQTT reconnect?

gate_scheme1

I have attached the ESP32 network layout diagram.

ESP32 is located in the village. Control over the script is implemented as follows: 1) detailed log of connections, disconnections and publications on the mqtt server 2) detailed log of WiFi and DHCP operation on the router

3) full syslog from espeasy with maximum debugging level. Including all GPIOs are registered as devices and their status changes are published in mqtt

4) Visual observations of esp32 operation by recording cameras

5) eyewitness observations

all data sources are synchronized via ntp

Based on the collected observations, the problem occurs when the MQTT server loses availability, while wifi and internet work stably.

Previously, in colder weather, the motor rotated a little slower and the attempt to restore communication with mqtt did not overlap the time interval at the end of the movement. Now in the summer the engine began to work faster and began to coincide.

uzi18 commented 3 months ago

What is your RSSI ? Do you have external wifi antenna? What is exact build of ESPEasy you use? Please show controller configuration / mqtt configuration page. What/Where is the code of sp_whex plugin/command? (check this: https://espeasy.readthedocs.io/en/latest/Plugin/P020.html#p020-page command serialsendmix) Try to disable syslog plugin and check if mqtt lag changes.

tabunslonopotamov commented 3 months ago

What is your RSSI ? Do you have external wifi antenna? What is exact build of ESPEasy you use? Please show controller configuration / mqtt configuration page. What/Where is the code of sp_whex plugin/command? (check this: https://espeasy.readthedocs.io/en/latest/Plugin/P020.html#p020-page command serialsendmix) Try to disable syslog plugin and check if mqtt lag changes.

rssi is low , no external antenna , poor environmental conditions. It is planned to switch to working via Ethernet; Ethernet has not been launched on this board for a long time. Recently I managed to solve it. In the specific problematic cases that I am writing about, there were no disruptions in WiFi operation. And there was a general problem with the availability of a remote mqtt server, recorded from several points on the Internet.

Build: 20230224 - Mega32 Build Origin: Self built Build Time:⋄ Feb 24 2023 04:13:04 Binary Filename: ESP_Easy_mega_20230224_custom_ESP32_4M316k_ETH Build Platform: Windows-7-6.1.7601-SP1

gate_controller_setting

sp_whex command removed from latest espeasy version. look up message

uzi18 commented 3 months ago

As you use custom builds you can define timeouts for your builds

TD-er commented 3 months ago

So the unit got disconnected from/by the MQTT broker, but the ESP had not lost its WiFi connection to the AP? This means there has been some reason for the broker to disconnect the ESPEasy node and somehow the ESP did not notice this disconnect.

Given you have some unpredictable hops between the ESP and the MQTT broker, I think you should increase (!!) the timeout set in the controller. Usually the MQTT reconnect is quite fast, assuming you don't need to reconnect to WiFi first. I did add some dynamic adjustments of the timeout based on the response time of previous published messages. So if you did set the timeout to 100 msec but the response time was longer than 1/3rd of this timeout, the new timeout would be set to 3x the average over the last few messages. (with max. of 1 sec if memory serves me right) However the initial attempt to make a connection is purely based on the settings. And given you're sending via public internet, I would not be surprised if the average ping time alone is already 50+ msec. So it might be that when you experience higher ping times, the initial timeout is too short.

Just to be sure, I will now look into the code to see whether the timeout is handled correctly.

Also could you try to make a build based on the current code and let some test node run with the same MQTT broker settings at that location? Preferrable a LittleFS build as that's using the latest ESP-IDF5.1 code. N.B. when flashing a LittleFS build over a SPIFFS build, you will loose your filesystem and thus your settings. So make sure to backup those.

N.B.2 Just as a precaution, please make sure "Use Last Connected AP from RTC" and "Restart WiFi Lost Conn" are both checked on the tools->Advanced page. This will make any WiFi reconnect more reliable and faster if it can reconnect to the same AP. This can also be done on the existing node.

When you experience low RSSI values, you might also want to consider checking "Force WiFi B/G" as this allows for a few dBm less signal strength while still maintaining a stable connection. However this highly depends on the used access point as some are notoriously bad at handling both 802.11n and 802.11g

tonhuisman commented 3 months ago

Not sure where you get support for that sp_whex command, as that's not supported by ESPEasy.

There are no problems with this command.

The problem with this command is that it was introduced in a pull request (#3980) that was never merged, but closed, The intent of that PR was implemented differently, but the command-alias wasn't ever included in ESPEasy.

This implies that you are running a self-built Custom version of ESPEasy, and most likely not using the latest source code, making it very hard to provide support. Stating the fact that you are using a custom build in the initial message would have been quite helpful.

Please update your source code to the latest available, as providing support on older releases is really hard because of the way software in general is evolving (updated Arduino framework versions f.i.) and other/many improvements in the ESPEasy code.

As an alternative you can use a standard Collection C build (not sure what plugins you need besides P087) or a MAX build (including all plugins) on an ESP32 with 16MB Flash memory, ESP32-S3 with 8MB or 16MB Flash, or an ESP32-C6 (preliminary support) with 8MB or 16MB Flash.

TD-er commented 3 months ago

Can you show the timingstats? This allows us to see where blocking code might be.

N.B. the timingstats will be cleared every time you reload the timing stats page.

TD-er commented 3 months ago

By the way, you can also try to change the MQTT_KEEPALIVE Just define it as 5 seconds in your custom build, or change it in lib/pubsubclient/src/PubSubClient.h (line 33)

tabunslonopotamov commented 3 months ago

So the unit got disconnected from/by the MQTT broker, but the ESP had not lost its WiFi connection to the AP? This means there has been some reason for the broker to disconnect the ESPEasy node and somehow the ESP did not notice this disconnect.

The broker had no reason. There was a short-term (2-3 minutes) disruption of communication via the Internet. This happens to us sometimes.

Ping delay via ethernet connection to mqtt server is 4ms . WiFi endpoint +2ms - 14ms

I have a second ESP32 available in the specified location and conditions and with the same controller settings. Works as a key radio on a pole.

version: Build:20230803 - Mega32 System Libraries:ESP32 SDK 4.4.4.20230310 Git Build:My Build: Aug 3 2023 23:39:15 Plugin Count:14 [No Debug Log] Build Origin:Self built Build Time:Aug 3 2023 23:38:28 Binary Filename:ESP_Easy_mega_20230803_custom_ESP32_4M316k_ETH Build Platform:Windows-10-10.0.19044-SP0

TD-er commented 3 months ago

I think I found something that for sure needs to be changed and you can also change it yourself and make a new build based on the same code base.

The MQTT_SOCKET_TIMEOUT is defined as 15 sec which is way too long. And when the connection gets interrupted while reading bytes, this may eventually block all other code execution for upto 15 seconds.

So as a test you can try to lower this to 1 second, while I will look into a more suitable fix.

tabunslonopotamov commented 3 months ago

The problem with this command is that it was introduced in a pull request (#3980) that was never merged, but closed, The intent of that PR was implemented differently, but the command-alias wasn't ever included in ESPEasy.

I don't remember getting the source tree any other way than just "git clone" head. Probably, one day this request was included in the main branch.

I can't use another esp32 model because I'm using a factory controller.

relay_board

TD-er commented 3 months ago

That looks like a Dingtian 8ch board right? With the latest code and a "LittleFS" build there is support for this Ethernet chip (JL1101 if I'm not mistaken) You need the LittleFS build for this as it requires ESP-IDF5.1 or newer for this Ethernet chip. (and the ESPEasy/Tasmota patch applied to the code as developed by Jason2866 and me)

Edit: Just for completeness the required settings for that Ethernet chip if it is indeed the same board I have here: image

tabunslonopotamov commented 3 months ago

That looks like a Dingtian 8ch board right? With the latest code and a "LittleFS" build there is support for this Ethernet chip (JL1101 if I'm not mistaken) You need the LittleFS build for this as it requires ESP-IDF5.1 or newer for this Ethernet chip. (and the ESPEasy/Tasmota patch applied to the code as developed by Jason2866 and me)

Yes, this was my issue request where the Ethernet issue on this board was resolved. But I have not yet installed a new board with an Ethernet connection, since maintenance work on this passage is very difficult. Residents start howling as soon as you turn it off for 5 minutes, that strangers are about to come.

tabunslonopotamov commented 3 months ago

Can you show the timingstats?

I need some time to assemble a test stand with a blinking LED. I don’t want to experiment with a productive installation, since every time I have to go do mechanical work to wedge the beam.

TD-er commented 3 months ago

With trying "another board" I meant literally any ESP32 board. No need to get it wired up, just to test if it also experiences these kinds of MQTT issues at that specific location.

By the way, I'm thinking of changing the PubSubClient::readByte function into this:

boolean PubSubClient::readByte(uint8_t * result) {
   if (_client == nullptr) {
     return false;
   }
   uint32_t previousMillis = millis();
   while(!_client->available()) {
     delay(1);  // Prevent watchdog crashes
     if((int32_t)(millis() - previousMillis) >= static_cast<int32_t>(_client->getTimeout())){
       return false;
     }
   }
   *result = _client->read();
   return true;
}

Still needs some testing, but I guess this may fix quite a lot of 'odd' MQTT related issues as the current code appears to be completely blocking for upto 15 sec. So you've only seen upto half of the worst-case delays

TD-er commented 3 months ago

Hmm the more I start digging, the more smelly it gets.... Apparently pubsubclient is using a fixed timeout on ESP32 to make the connection:

// ESP32_CONNECTION_TIMEOUT : Specific case for ESP32, we need to manually provide timeout as default (-1) leads to WDT reset (after 5 seconds).
// By default (4500 milliseconds) 4,5 seconds to avoid reaching 5s default watchdog reset time.
// This is multiplied in WiFiClient.cpp (part of arduino-esp32) by 1000 inside [WiFiClient::connect] method.
// ESP8266 Arduino framework in contrast has fixed 5000ms timeout. No need to define it manually here.
// See: https://github.com/knolleary/pubsubclient/pull/842
#ifndef ESP32_CONNECTION_TIMEOUT
#define ESP32_CONNECTION_TIMEOUT 4500
#endif

I have not looked closely at this code since 2018...

tabunslonopotamov commented 3 months ago

Hmm the more I start digging, the more smelly it gets.... Apparently pubsubclient is using a fixed timeout on ESP32 to make the connection:

well, judging by the quote, this is only if ESP32_CONNECTION_TIMEOUT was not defined anywhere before.

TD-er commented 3 months ago

I do set the timeout in the client, so it should not try to override such specifically set timeouts while making the connection. Otherwise, I get issues like (random example) "Losing communication with the mqtt broker causes a delay in processing gpio events"

TD-er commented 3 months ago

Can you test the changes I made in this PR: https://github.com/letscontrolit/ESPEasy/pull/5076 ?

uzi18 commented 3 months ago

@TD-er so it looks like our core controller was buggy. Maybe some forks of this lib were already patched for more bugs?

TD-er commented 3 months ago

I have looked at a few forks not that long ago. And since I also added some extra stuff to pubsubclient which is not in the original lib (nor any fork), I'm not that eager to swicth.

Here is one of the more active forks: https://github.com/hmueller01/pubsubclient/tree/dev-fixes

uzi18 commented 3 months ago

here not sure if we use MQTT_MAX_TRANSFER_SIZE but uint8_t could be too small: https://github.com/letscontrolit/ESPEasy/blob/8fb409cd6f180ee5d67cd32cffccf5441a88f908/lib/pubsubclient/src/PubSubClient.cpp#L576-L581

TD-er commented 3 months ago

OK, just looked a bit through the most recent commits of that one and no way we're going to use it. Take for example this commit: https://github.com/hmueller01/pubsubclient/commit/870e8d26f8b078ebbccef5f9c7bc288873723fd4

Look at this commit: https://github.com/hmueller01/pubsubclient/commit/f064788c1e0ee1dcdc76c033dd01d578ea8af18e

When using inheritance, you should absolutely not call member functions in the constructor and here they try to call the constructor of itself from other constructors.

If you want to clean-up the constructors by using default values, assign them when declaring the members in the .h file. Not like this.

TD-er commented 3 months ago

here not sure if we use MQTT_MAX_TRANSFER_SIZE but uint8_t could be too small:

https://github.com/letscontrolit/ESPEasy/blob/8fb409cd6f180ee5d67cd32cffccf5441a88f908/lib/pubsubclient/src/PubSubClient.cpp#L576-L581

We don't use it but indeed it might be a bit small. I will take a look at other uses of uint8_t or unsigned char to see if there are similar issues for other functions.

Edit: Hmm some functions have a size of uint16_t. Not sure if that's potentially too small. Also no idea if it may cause issues when needing more bytes to mark the length of some message. Anyway we can't send it as a single message anyway unless it is streamed and that's done in some functions I added a while ago.

tabunslonopotamov commented 2 months ago

Please tell me what decision was ultimately made in the main branch? Should it work differently when running over Ethernet than over WiFi? Switched to Ethernet and found the problem persisted. (although it was also not possible to reliably find out whether it disappeared via WiFi).

The Ethernet port itself and the nearest router do not turn off, my connection to the mqtt server is lost and everything freezes for a few seconds.

TD-er commented 2 months ago

Which build are you using?

tabunslonopotamov commented 2 months ago

Which build are you using?

Build: ESP_Easy_mega_20240706_custom_ESP32_4M316k_LittleFS_ETH Jul 6 2024

TD-er commented 2 months ago

Hmm that's so like "5 minutes ago"... ;)

Right now I have no idea what may be the problem here.

uzi18 commented 2 months ago

maybe we can try to connect to your mqtt server with other username/password and check if behavior is espeasy independent?