tbnobody / OpenDTU

Software for ESP32 to talk to Hoymiles/TSUN/Solenso Inverters
GNU General Public License v2.0
1.78k stars 495 forks source link

WiFi issue still exists #2202

Open SteffenR87 opened 1 month ago

SteffenR87 commented 1 month ago

What happened?

Well I was affected with previous release WiFi disconnect. Two days ago I updated to latest version and today 4pm I had the same kind of disconnect. Reboot and back online.

To Reproduce Bug

I will wait if it happens in 2 days time again

Expected Behavior

Stable WiFi connection

Install Method

Pre-Compiled binary from GitHub

What git-hash/version of OpenDTU?

Latest

Relevant log/trace output

No response

Anything else?

No response

Please confirm the following

CommanderROR9 commented 2 weeks ago

I have "fixed" the issue (for now) by turning off the automatic channel selection for 2.4 and 5GHz on my Fritzbox. Since it's only been running for a day though, I can't really say for sure how it would react to something like the Router or one of the Mesh AP restarting after an Update. So while this is a passable workaround, I still think the OpenDTU should check the WiFi connection every Minute or so and reconnect if it has lost the link. Should not he too hard to implement since literally every other piece of tech manages it...🤷‍♂️

stefan123t commented 2 weeks ago

@CommanderROR9 as you seem to be eager on testing things on a separate ESP32 out, you may want to take a look at the ESP-IDF roaming examples using 802.11k/r/v as mentioned earlier ?

https://github.com/tbnobody/OpenDTU/issues/2202#issuecomment-2289412118

I do not know what exactly has to be changed to make OpenDTU implement the (fast) basic service set transition and wireless network management in those roaming examples to support mesh steering as used by Fritz!Box Mesh.

jonastr commented 2 weeks ago

@stefan123t

I do not know what exactly has to be changed to make OpenDTU implement the (fast) basic service set transition and wireless network management in those roaming examples to support mesh steering as used by Fritz!Box Mesh.

I am no expert in this either -- WDYT about the following pragmatic 1st step: Check connectivity regularly, attempt to reconnect if connection is lost?

I am suggesting/asking this because I have another Wifi device with a USB Wifi dongle that does not support mesh steering either (I think none of the k/r/v standards). But it still reconnects when the AP terminates the connection -- according to AVM docs, this is what the Fritz mesh implementation does to force legacy devices to look for a better AP connection. It's not really mesh steering because these legacy devices still only pick the AP based on their local signal strength, not based on mesh load.

Anyway, this machine is running Ubuntu Linux. Aka, there appear to be different strategies from basic "just reconnect to AP with best signal" to fully following the standards of service set transition (mesh steering).

Andre0be commented 2 weeks ago

I think only the implementation of 802.11 k/r/v will help in the constellation with AVM Mesh. Otherwise a fixed channel (channel 1 is stable for me) or do not include the devices in the mesh. It would be nice if someone could tell me if there are plans to implement k/r/v in OpenDTU?

jonastr commented 2 weeks ago

@Andre0be

I think only the implementation of 802.11 k/r/v will help in the constellation with AVM Mesh.

Could you elaborate, why do you think that is the case? Like how is an average legacy Wifi dongle different (see my example above) - or why would a simple reconnect strategy not help at all? I don't say it would be the best solution, but couldn't it at least help alleviate the problem until proper k/r/v is implemented?

tbnobody commented 2 weeks ago

I don't say it would be the best solution, but couldn't it at least help alleviate the problem until proper k/r/v is implemented.

k/r/v has to be implemented in the Arduino core. This is nothing which can be done in user space. Checking whether the connection is established is indeed already done. Its not done per interval its done event based. The IDF/Arduino core sends a disconnect event. Then the connection will be re-established. In this case you should also see an console output. It this does not happen, it's a bug in the Arduino Core.

CommanderROR9 commented 2 weeks ago

I know almost nothing about ESP32 Hardware or Software development...and I probably won't have much time to experiment with mine over the next two weeks.

My guess as to what is causing this issue though is, that the ESP32 doesn't realise it has been disconnected due to the nature of the disconnect. If the Network were to drop out, this would likely trigger the reconnect. However, of it's only a "Channel change" by the Fritz Box, this might not trip the right wires. The ESP will perhaps still see the WiFi.and not realise it can no longer send or receive anything. IMHO this could easily be rectified by a "heartbeat" which would essentially ping the DNS Host at regular intervals and trigger a disconnect/reconnect if the Ping times out.

tbnobody commented 2 weeks ago

would essentially ping the DNS Host at regular intervals

DNS is not mandatory for a working DTU. Means, it can be empty.

CommanderROR9 commented 2 weeks ago

That's interesting. I wasn't aware the network connection could work with DNS. Would that be with mDNS active then?

jonastr commented 2 weeks ago

@tbnobody I see.

Despite DNS, the DTU could nonetheless ping any IP address, right? Say, some host in the internal network like the router or MQTT server's address, possibly as an optional configuration.

CommanderROR9 commented 2 weeks ago

That would be an easy test IMHO..add an optional "ping heartbeat" setting. In my case I would just ping the Router 192.168.178.1 since that should always be Online

stefan123t commented 2 weeks ago

I don't say it would be the best solution, but couldn't it at least help alleviate the problem until proper k/r/v is implemented.

k/r/v has to be implemented in the Arduino core. This is nothing which can be done in user space. Checking whether the connection is established is indeed already done. Its not done per interval its done event based. The IDF/Arduino core sends a disconnect event. Then the connection will be re-established. In this case you should also see an console output. It this does not happen, it's a bug in the Arduino Core.

@tbnobody did you look into the ESP-IDF example code for 802.11k/r/v mesh steering?

https://github.com/espressif/esp-idf/tree/master/examples/wifi/roaming

Maybe it is not yet implemented in the Arduino core which OpenDTU builds rely on.

But on another thought maybe the events and processes of switching the STA from one of these BSSIDs to another BSSID allow us to determine which event may be missing / needs to be changed to make it work in Arduino/OpenDTU too ?

https://github.com/espressif/esp-idf/blob/6673376297b921d438790d195014b860eaf8bb17/examples/wifi/roaming/roaming_11kvr/main/roaming_example.c#L48-L55

Here they explicitly distinguish the disconnect reason WIFI_REASON_ROAMING:

    } else if (event_base == WIFI_EVENT && event_id == WIFI_EVENT_STA_DISCONNECTED) {
        wifi_event_sta_disconnected_t *disconn = event_data;
        ESP_LOGI(TAG, "station got disconnected reason=%d", disconn->reason);
        if (disconn->reason == WIFI_REASON_ROAMING) {
            ESP_LOGI(TAG, "station roaming, do nothing");
        } else {
            esp_wifi_connect();
        }
tbnobody commented 2 weeks ago

Currently I don't care about the disconnect reason. If a disconnect is received I just try to reconnect: https://github.com/tbnobody/OpenDTU/blob/0cc55f3b8723723aea234954eaf7ab6cd13b53a4/src/NetworkSettings.cpp#L78-L86

But that leads again to my request of a console output in case of a reconnect... Do you see a WiFi disconnected disconnected in the serial console if the issue occurs?

jonastr commented 2 weeks ago

@tbnobody I would be happy to try to provide some logs. Unfortunately, I am not sure I have the means to retrieve the serial log. Did a bit of googling, returned various results, all with some gaps in the instructions, I feel.

The best resource I could find is this one: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/get-started/establish-serial-connection.html. But it's still pretty vague, not sure what to do exactly for the board that I have.

Do you have any better resources/howtos at hand?

tbnobody commented 2 weeks ago

Do you have any better resources/howtos at hand?

https://www.opendtu.solar/firmware/howto/serial_console/

schneeer commented 2 weeks ago

Hi, I am following the topic with interest. Just a little update on my setup. I will then leave the field here to the specialists.

For most, WLAN shouldn't cause any problems, unless there are only a handful of openDTUs.

My DTU has been running completely error-free for almost 23 days since the update to v24.8.5.

Connection characteristics: Wi-Fi 4, 20 MHz, WPA3, 1 x 1

FRITZ!Box 7590, FRITZ!OS: 7.90-115394 BETA 1 repeater 2400 as a LAN bridge, connected to the FB 1 repeater 1750E as a LAN bridge, connected to the FB 1 repeater 1750E as a WLAN bridge, connected to the 1st 1750E

Band and mesh steering are active WLAN auto channel is active

2.4 GHz currently on channel 1 5 GHz currently on channel 36

At the moment the DTU (OpenOTU-BW) was directly connected to FRITZ!Box. (As can be seen in the picture.) I have now restarted it and it is now connected to the 2nd 1750E repeater (WLAN bridge). In the evening, however, the electricity is switched off at 2nd 1750E, for personal reasons to protect the neighbors from noise. The DTU would definitely have to rebook again. I just simulated this by turning the fuse off and on. After 2 minutes the DTU on the 1st 1750E was online again.

I'm watching this. If any unusual behavior turns out, I will report it. Screenshot 2024-09-02 135032

stefan123t commented 2 weeks ago

@schneeer please consider providing us with a detailled serial log as requested:

console output in case of a reconnect... Do you see a WiFi disconnected in the serial console if the issue occurs?

Follow the link to the documentation to setup for USB / serial logging: https://www.opendtu.solar/firmware/howto/serial_console/

@tbnobody would it be possible to add the disconnect reason to the logging somehow ?

I searched in arduino-esp32 and found the following issue for WiFi Roaming:

https://github.com/espressif/arduino-esp32/issues/7921

it enables the following four options in the PR:

https://github.com/espressif/esp32-arduino-lib-builder/pull/160

CONFIG_ESP_WIFI_11KV_SUPPORT=y
CONFIG_ESP_WIFI_SCAN_CACHE=y
CONFIG_ESP_WIFI_MBO_SUPPORT=y
CONFIG_ESP_WIFI_11R_SUPPORT=y
schneeer commented 2 weeks ago

Yes, I can do that, but as I said, to date the DTU has run without any errors for 23 days without (Wifi) interruption. I have no problems. I only wrote my post because I read here that the FRITZ!Box Mesh could be the cause. That doesn't apply to me.

I can relatively easily deactivate every WLAN AP in the mesh to which the DTU is currently connected.

The 1st Test

The 2nd test

The 3rd test

Edit:

Why the log is now destroyed, I have no idea. In the preview everything was listed nicely in order.

tbnobody commented 2 weeks ago

Yes, I can do that, but as I said, to date the DTU has run without any errors for 23 days without (Wifi) interruption. I have no problems.

Yes, same applies to me. I also cannot reproduce this issue. 11K is enabled on the relevant SSID, 11V is disabled (as I don't have any devices which support it). And thats absolutly ok. If older devices don't support 11K, depending on the AP, they get just kicked out of one AP thinks that another one has better connection RSSI. Background: A wifi client stays at one AP as long as it can reach it, even it finds another one with the same SSID but different BSSID. A workaround in previous days (and for compatibility reasons) was and is to kick the client from one AP and let it connect to another one. This is exactly what @schneeer describes if he turns off one repeater manually).

tbnobody commented 2 weeks ago

@tbnobody would it be possible to add the disconnect reason to the logging somehow ?

Sure, just apply this code changes or wait for the next release:

diff --git a/include/NetworkSettings.h b/include/NetworkSettings.h
index 40ddc914..433867e9 100644
--- a/include/NetworkSettings.h
+++ b/include/NetworkSettings.h
@@ -62,7 +62,7 @@ private:
     void setStaticIp();
     void handleMDNS();
     void setupMode();
-    void NetworkEvent(const WiFiEvent_t event);
+    void NetworkEvent(const WiFiEvent_t event, WiFiEventInfo_t info);

     Task _loopTask;

@@ -85,4 +85,4 @@ private:
     bool _lastMdnsEnabled = false;
 };

-extern NetworkSettingsClass NetworkSettings;
\ No newline at end of file
+extern NetworkSettingsClass NetworkSettings;
diff --git a/src/NetworkSettings.cpp b/src/NetworkSettings.cpp
index 55ea428e..31313feb 100644
--- a/src/NetworkSettings.cpp
+++ b/src/NetworkSettings.cpp
@@ -23,20 +23,21 @@ NetworkSettingsClass::NetworkSettingsClass()
 void NetworkSettingsClass::init(Scheduler& scheduler)
 {
     using std::placeholders::_1;
+    using std::placeholders::_2;

     WiFi.setScanMethod(WIFI_ALL_CHANNEL_SCAN);
     WiFi.setSortMethod(WIFI_CONNECT_AP_BY_SIGNAL);

     WiFi.disconnect(true, true);

-    WiFi.onEvent(std::bind(&NetworkSettingsClass::NetworkEvent, this, _1));
+    WiFi.onEvent(std::bind(&NetworkSettingsClass::NetworkEvent, this, _1, _2));
     setupMode();

     scheduler.addTask(_loopTask);
     _loopTask.enable();
 }

-void NetworkSettingsClass::NetworkEvent(const WiFiEvent_t event)
+void NetworkSettingsClass::NetworkEvent(const WiFiEvent_t event, WiFiEventInfo_t info)
 {
     switch (event) {
     case ARDUINO_EVENT_ETH_START:
@@ -76,7 +77,8 @@ void NetworkSettingsClass::NetworkEvent(const WiFiEvent_t event)
         }
         break;
     case ARDUINO_EVENT_WIFI_STA_DISCONNECTED:
-        MessageOutput.println("WiFi disconnected");
+        // Reason codes can be found here: https://github.com/espressif/esp-idf/blob/5454d37d496a8c58542eb450467471404c606501/components/esp_wifi/include/esp_wifi_types_generic.h#L79-L141
+        MessageOutput.printf("WiFi disconnected: %d\r\n", info.wifi_sta_disconnected.reason);
         if (_networkMode == network_mode::WiFi) {
             MessageOutput.println("Try reconnecting");
             WiFi.disconnect(true, false);
schneeer commented 2 weeks ago

I was interested to know whether the latest updates have led to increased DTU activity and thus increased power consumption. It could have been that, for example, the power supply is too weak and therefore there are problems with the WLAN or problems with the connection to the inverter.

I couldn't find anything unusual.

When restarting, my DTU with display and NRF24 requires around 180 mA, and after a few seconds it only needs just under 100 mA. This is pretty much the same for several firmware versions.

So I can rule that out.

tc66c USB-C Power-Meter

CommanderROR9 commented 2 weeks ago

I still can't provide any meaningful insight, but turning off "auto channel" on my FritzBox 7690 seems to have completely eliminated the issue.

Edit: I installed an Update for my Router this afternoon. At first the OpenDTU was still alive after the Update, but had connected to a different AP within the mesh. When I checked again later, the DTU had completely lost connection to the Network and never reconnected.

Andre0be commented 1 week ago

I once bought an ESP32-S3 because it works better with WiFi. Yes, it's true. The S3 can handle AVM Mesh much better. I'm now going to switch to the S3 completely.

stefan123t commented 1 week ago

Wwe would need some more logs from any of the Mesh users posting about a problem here:

@SteffenR87 @jonastr @DietmarSi @ComGreed @CommanderROR9 @home-cloud @cortmen and whoever else still has the problem of loosing the connection with the DTU / inverter please 🙏 send us a log so we can analyse the issue in more detail.

Currently we only have anecdotal evidence that you are all plagued by this issue and it may have something to do with Mesh Roaming / Steering. But we need logs (Oh Precious Lognuts).

Follow the link to the documentation to setup for USB / serial logging: https://www.opendtu.solar/firmware/howto/serial_console/

PS: @Andre0be we have found out that neither our OpenDTU nor the default Arduino WiFi libraries do support 802.11r/k/v yet. That is the ESP-IDF has some example code but regardless of whether you are using the ESP32S3 or any other Espressif chip, neither will make use of these developments yet.

You may be right that ESP32S3 could be a bit better at WiFi in general eg compared with the slightly older ESP32 Wroom models but not with the above three standards on mesh roaming and steering.

jonastr commented 1 week ago

Wwe would need some more logs from any of the Mesh users posting about a problem here:

@SteffenR87 @jonastr @DietmarSi @ComGreed @CommanderROR9 @home-cloud @cortmen and whoever else still has the problem of loosing the connection with the DTU / inverter please 🙏 send us a log so we can analyse the issue in more detail.

Hi @stefan123t, fully understood. While I am willing to contribute, I didn't have the time yet to set everything up. It's no one's fault, but given that I need to get another machine ready to monitor the logs doesn't make it easier; I don't have one at hand. It may take some more days, perhaps weeks until I find the time. :)

In the meantime, I can report that my problems seem to have vanished for now after physically relocating the DTU very close to the main Fritzbox. This was only possible because I relocated the inverter before. In any case, I had no interruptions since multiple days after that, despite mesh steering being activated. I am pretty sure I can reproduce it in the old physical DTU location, though. So again, when I find the time, I will capture the logs.

CommanderROR9 commented 1 week ago

So.... unfortunately I still can't provide any logs, but after having a stable connection for about a week I decided to turn the "Auto Channel" feature back on in my FritzBox and...after just an hour or so the OpenDTU lost connection again and never rejoined the Network.

stefan123t commented 1 week ago

@CommanderROR9 bitte die Logs dazu sonst können wir das Problem nicht analysieren. Danke!

Follow the link to the documentation to setup for USB / serial logging: https://www.opendtu.solar/firmware/howto/serial_console/

CommanderROR9 commented 1 week ago

@CommanderROR9 bitte die Logs dazu sonst können wir das Problem nicht analysieren. Danke!

Follow the link to the documentation to setup for USB / serial logging: https://www.opendtu.solar/firmware/howto/serial_console/

Next week I should have a little more time and will attempt to provide the logs.

jonastr commented 1 week ago

I am also curious about the logs! From what you observe @CommanderROR9 and from what I have observed, I see the following pattern that is not so much related to Mesh only: The DTU has trouble reconnecting in multiple scenarios:

cortmen commented 1 week ago

Hi there, i have change my opendtu to esp32-s3 and a new location. since one week no wifi disconnect with v24.8.5

SebStaeubert commented 1 day ago

Zum Testen würde ich doch einfach mal wenn der Fehler wieder auftritt, openDTU eingeschaltet lassen und den WLAN-Router aus und einschalten ob die DTU sich dann wieder verbindet.

Gute Idee! Ich berichte dann

Das bringt bei mir nichts. Erst mit Neustart der OpenDTU kommt wieder eine Verbindung zustande (ich habe 1+ Stunde vergeblich gewartet und gehofft, dass die Verbindung automatisch wieder hergestellt wird).