openshwprojects / OpenBK7231T_App

Open source firmware (Tasmota/Esphome replacement) for BK7231T, BK7231N, BL2028N, T34, XR809, W800/W801, W600/W601 and BL602
https://openbekeniot.github.io/webapp/devicesList.html
1.39k stars 241 forks source link

Socket restarts for no apparent reason #949

Open CLAM01 opened 10 months ago

CLAM01 commented 10 months ago

My smart sockets restart every now and then for no apparent reason. How can I find out why the devices keep restarting? The logs are always refreshed after a reboot. A clear and concise description of what the bug is.

Firmware:

CLAM01 commented 10 months ago

I have found the ping watchdog. The watchdog was enabled by default, but the IP Address to ping was not my router. I think i lost my WLAN connection and 0 seconds after try to ping the false IP Address, the System restarts?

CLAM01 commented 10 months ago

With the watchdog timer activated the whole thing is even worse. All devices now go crazy, regardless of whether a wdt is switched on or not.

CLAM01 commented 10 months ago

I check, if the Mesh in my home makes the problems. I have reduce the WLAN Power of my fritzbox 7530 and will see the next days. it is possible, that openBK reboots, when Fritzbox send a disconnect message?

CLAM01 commented 10 months ago

I've now tested it. It can't be my mesh. A wide variety of devices restart on different access points almost in parallel. If I unplug a repeater on which devices are registered, they switch to the next best repeater without rebooting.

What can I do to find the error? Can I have the log of a device written externally via MQTT or similar?

CLAM01 commented 10 months ago

i think its a heap problem...

CLAM01 commented 10 months ago

Here a Debug of my smart Mood light from LCS.

ExtraDebug:HTTP:Is MSEARCH - responding Debug:HTTP:DRV_SSDP_Send_Advert_To: sent message Error:MAIN:Low heap warning! Info:MAIN:Time 1797, idle 171397/s, free 23328, MQTT 1(1), bWifi 1, secondsWithNoPing 1714, socks 7/38 xxxxxxxxxxxxxxxxxxxxxxxx] Info:MQTT:MQTT_RegisterCallback called for bT MoodLight_Telefon_Flur/ subT MoodLight_Telefon_Flur/+/set Info:MQTT:MQTT_RegisterCallback called for bT bekens/ subT bekens/+/set Info:MQTT:MQTT_RegisterCallback called for bT cmnd/MoodLight_Telefon_Flur/ subT cmnd/MoodLight_Telefon_Flur/+ Info:MQTT:MQTT_RegisterCallback called for bT cmnd/bekens/ subT cmnd/bekens/+ Info:MQTT:MQTT_RegisterCallback called for bT MoodLight_Telefon_Flur/ subT MoodLight_Telefon_Flur/+/get Info:MQTT:MQTT_RegisterCallback called for bT tele/MoodLight_Telefon_Flur/ subT tele/MoodLight_Telefon_Flur/+ Info:MQTT:MQTT_RegisterCallback called for bT stat/MoodLight_Telefon_Flur/ subT stat/MoodLight_Telefon_Flur/+ Info:HTTP:DRV_SSDP_Init - no wifi, so await connection Info:MAIN:Started SSDP. Info:MAIN:Started Wemo. Error:CMD:LFS_ReadFile: lfs is absent Info:CMD:CMD_StartScript: failed to get file autoexec.bat Info:MAIN:Main_Init_After_Delay done Info:MAIN:Time 1, idle 272956/s, free 78336, MQTT 0(0), bWifi 0, secondsWithNoPing -1, socks 2/38 Info:MAIN:Time 2, idle 185348/s, free 78336, MQTT 0(0), bWifi 0, secondsWithNoPing -1, socks 2/38 Info:MAIN:Time 3, idle 187020/s, free 78336, MQTT 0(0), bWifi 0, secondsWithNoPing -1, socks 2/38 Info:MAIN:Time 4, idle 186892/s, free 78336, MQTT 0(0), bWifi 0, secondsWithNoPing -1, socks 2/38 Info:MAIN:Time 5, idle 185398/s, free 78336, MQTT 0(0), bWifi 0, secondsWithNoPing -1, socks 2/38 Info:MAIN:ssid:mySSID key:myKey Info:MAIN:Time 6, idle 178546/s, free 71560, MQTT 0(0), bWifi 0, secondsWithNoPing -1, socks 2/38 Info:MAIN:Boot complete time reached (5 seconds) Info:CFG:####### Set Boot Complete #######

openshwprojects commented 10 months ago

Hello, which versions of OBK? How many devices do you have? We should try to narrow the issue.

PS: As a mitigation, you can use "remember last state" for sockets, at least until we fix it

CLAM01 commented 10 months ago

Hi, I have 9 LSC Mood Lights with BK7231N an 4 LSC Smart Plugs with BK7231T. All Devices are affected. What do you need to debug?

The versions: 1.17.287 (T) 1.17.186 (T) (N) 1.17.185 (T) 1.17.200 (N) 1.17.247 (N) 1.17.225 (N) 1.17.201 (N) 1.17.195 (T) 1.17.196 (N) 1.17.265 (N) 1.17.261 (T)

Various versions in production.

openshwprojects commented 10 months ago

And all your devices have slowly decreasing heap size?

CLAM01 commented 10 months ago

Yes, all the devices I looked at today started running out of storage space at some point.

Some have been quiet for 18 hours now, some are just restarting, some have been quiet for 5 hours.

I can't say what it could be at the moment because it's not cyclical, just sporadic.

Actually all devices are working without reboots and heap space warning.

Edit: I'm currently observing a device where the heap slowly went down and now goes back up again.

Could it possibly be the garbage collector?

CLAM01 commented 10 months ago

I see that whenever the heap space becomes less, the socks are increased.

Unfortunately, I don't know what these socks are or what the other values ​​mean.

Info:MAIN:Time 31224, idle 249599/s, free 78832, MQTT 1(44), bWifi 1, secondsWithNoPing -1, socks 3/38 Info:MAIN:Time 31225, idle 248134/s, free 69128, MQTT 1(44), bWifi 1, secondsWithNoPing -1, socks 4/38

And here the Log see before: Error:MAIN:Low heap warning! Info:MAIN:Time 1797, idle 171397/s, free 23328, MQTT 1(1), bWifi 1, secondsWithNoPing 1714, socks 7/38

openshwprojects commented 10 months ago

I have seen this issue once. It was somehow related to the RF partition. I think that it had been fixed by clearing RF partition at the expense of worse WiFi range. You can make a backup and try.

I don't know what is causing that, but basically the sockets are left open forever and HTTP buffers are taking space.

I could try to work on that but I am unable to reproduce right now.

openshwprojects commented 10 months ago

Could you try with another router so I can at least know whether it's a per-device fault or per-router?

openshwprojects commented 10 months ago

Or do you maybe have some kind of bot, or scanner, that's opening ports on LAN?

CLAM01 commented 10 months ago

How i do a delete the RF Partiton? How i do a Backup of the device / RF Partion?

I have a Mesh with 1 Master Router an 3 Slave Router. The devices are connected on different Master/Slave Router (Wifi) image

I can observe if only devices go into heap problems on specific Mesh System device.

How do i need a bot or scanner to open Ports in LAN?

CLAM01 commented 10 months ago

I can observe if only devices go into heap problems on specific Mesh System device.

No, the Repeater on my mesh are irrelevant. The same Device reboots after heap on my FritzBox (Mesh Master) oder fritzRepeater 600 (Mesh Slave)

openshwprojects commented 10 months ago

So it's a problem specific to one of the routers? This is as I expected, but it is also not good. I am not sure if RF clear trick can help here.

CLAM01 commented 10 months ago

No, the device restarts because of the heap, regardless of whether it is connected via WLAN on my Mesh Master (Fritzbox) or on one of the AccesPoints (Mesh Slave).

Today I can only observe one device that constantly runs into the heap after 5-10 minutes.

I will continue to monitor.

Do you have any ideas on how I could get to the bottom of this further?

CLAM01 commented 10 months ago

Can you explain to me what kind of sockets are used? Are these all HTTP sockets for SSDP?

I watched all day today, but no device failed today.

The sockets went up to the limit of 7 - 8 open sockets, but were then removed again?

Would a shortened TTL (TimeToLive), which could be configurable if necessary, help here or a maximum number of parallel open sockets?

I'm a bit at a loss as to what it could be. I turned my network upside down, analyzed which devices were connected (PC etc.), but I can't find any connection.

Does it make sense to do a TRACE? You can create a capuring on the Fritzbox via fritz.box/#cap.

openshwprojects commented 10 months ago

I have encountered this issue once before, I remember it was caused by something in internal RF partition, but I am not sure.

It seemed like the HTTP request handler thread was stuck at some TCP operation so the buffer alloced for HTTP was left alloced forever, thus memory usage was increasing.

I don't know why it doesn't have a working delay. Maybe an oversight in LWIP library or something.

I was not able to reproduce it later so I left it as is....

The problem happens here: https://github.com/openshwprojects/OpenBK7231T_App/blob/main/src/httpserver/http_tcp_server.c This is where memory is alloced: image

I am not able to reproduce the problem right now so it's hard for me to check where exactly it gets stuck...

CLAM01 commented 10 months ago

Did you open one HTTP Socket for one SSDP UDT Search?

openshwprojects commented 10 months ago

I am not sure what are you asking about. SSDP works over UDP, as far as I remember. It does not go through our HTTP. Hmm...

Futhermore, all HTTP packet receive events should just send reply and close socket, free memory.

I think there might be some problem with underlying LWIP SDK... maybe indeed that keepalive time setting is wrong.

CLAM01 commented 10 months ago

Yes, SSDP works over UDP Port 1900. i have analyse some Wireshark traffics. and sometimes i see a bulk of SSDP Requests to the Windows IP 239.255.255.250 to the same time. image

All TCP Connections are OK with FIN ACK -> ACK. But this are only TCP and HTTP Communication between the Device and my PC for the Log in WebUI.

All MQTT Publishing Traffic OK (Publish, received, release Complete ACK. image

I try to reproduce the Problem but its not easy. Actuality i have disabled the Wemo Driver on this BekenDevice.

openshwprojects commented 10 months ago

Ok disable SSDP and report back but I think it's the HTTP fault strictly because, as I said, to the best of my knowledge SSDP is using separate UDP socket, and the buffers are allocated in HTTP code. Unless... some kind of final stage of SSDP takes place over HTTP.

The separate issue is WHY this socket stays on forever. I will look into LWIP settings.

CLAM01 commented 10 months ago

OK, SSDP i disabled and i will see and inform.

CLAM01 commented 10 months ago

Ok, SSDP (the once active driver) is disabled and there are stable 2 Socks open.

CLAM01 commented 10 months ago

Next strange thing.. i have disabled alle drivers on device over WebApp, but alexa can interact with this device.

openshwprojects commented 10 months ago

Have you restarted?

CLAM01 commented 10 months ago

No, the UI said, the Drivers are disabled. The Socks are lower after disable the drivers without restart and the socks ar constant.

openshwprojects commented 10 months ago

you need to restart, it's not even a bug, it;'s just that some clean up functions are not implemented. I can add them for you if you tthink they could be useful

CLAM01 commented 10 months ago

No, its OK, i thing there are many usefull things to implement. Its a great job you do! thanks.

CLAM01 commented 10 months ago

what version of lwip did you integrate? i see its the 2.1.3.

the 2.2.0 ist the last stable release. Is there hope, that the Problem in 2.2.0 fixed?

image

CLAM01 commented 10 months ago

Am I correct that in lwip 2.1.3 a similar, if not even the same topic has been edited? Problems with downgrading the closed sockets.

image

divadiow commented 8 months ago

did you ever get to the bottom of this? I've only done two UK Tuya smart sockets to OpenBeken so far and they both restart at random intervals. One has power monitoring and the other does not. Both are on 1.17.366

CLAM01 commented 8 months ago

I have try to find the problem but no source found. I have seen, if I disable SSDP the problem with the sockets are mitigated, but this is not the solution, you need SSDP to use Wemo.

CLAM01 commented 8 months ago

Today 10 Restarts 1 Socket. I have shutdown the devices an restart the devices on release with a fix. Its not good for the Devices they are connected to this sockets.

divadiow commented 8 months ago

odd there isn't wider reporting of the issue though. there must be LOADS of people that have OpenBKd their smart plugs. I don't see any threads on Elektroda for this problem.

openshwprojects commented 8 months ago

I'm sorry to hear that you still experience this problem. Well, @divadiow @CLAM01 , the issue is simple, look: image This is a my T smart plug that runs 17 days straight with SSDP, NTP, BL0937, etc without restarts... image I run multiple OBK devices and I don't seem to have that problem. As far as I remember, we had LWIP update by @valeklubomir one year ago and it supposedly fixed most of instabilities.

Maybe we indeed need next LWIP update, but many people have tested so far and it was stable so it's strange that issue manifests for you suddenly..

Do you have C experience @CLAM01 ? Maybe we should really try the new LWIP and see if it helps on your side

divadiow commented 8 months ago

I could also setup a couple of plugs known to reboot on a segregated independent ssid to see how they behave

CLAM01 commented 8 months ago

Today 80 restarts in summary on all devices. I don't have experience in C or any other program languages. The firmware I use is younger than 1 year. I don't locate the problem. I use smart lights an plugs from LSC Action store with only SSDP an Wemos enabled. The restartpwriod today are 10 minutes. The smartplugs an smart lights are integrated in Alexa.

Some days no restarts, and some days like today the devices restarts permanently. I don't know how i can search the root of the problem.

Actually the first connected device one ome smartplug are broken, to many restarts and voltage peaks by disconnect the relay.

CLAM01 commented 8 months ago

Yes, you can integrate the new lwip in a release and I will update all my devices an see if the problem still alive.

openshwprojects commented 8 months ago

Was it always like this, or did it start to happen some time ago?

niedz., 7 sty 2024 o 17:34 Christoph Lammers @.***> napisał(a):

Yes, you can integrate the new lwip in a release and I will update all my devices an see if the problem still alive.

— Reply to this email directly, view it on GitHub https://github.com/openshwprojects/OpenBK7231T_App/issues/949#issuecomment-1880106641, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUMGZ627DPN5UTL6NQCQW6DYNLE75AVCNFSM6AAAAAA6MR5JPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGEYDMNRUGE . You are receiving this because you commented.Message ID: @.***>

CLAM01 commented 8 months ago

I think it's always been that way. But I only noticed it when I connected devices to the smart plugs, such as televisions or lamps. I have also installed different software versions. Everyone shows the same behavior. See list above. I found out that it stops when I turn off SSDP, but then I can no longer do Alexa integration. A scan of the network traffic showed no abnormalities. I can't think of anything else I can test. Today all my devices are going crazy. 13 devices restart constantly. I am grateful for every tip and would be happy to help solve the problem.

CLAM01 commented 8 months ago

Since all the lamps and smart sockets are crazy these days, I deactivated SSDP on one of them again. Now it should remain stable. I only have to wait 5 minutes.

divadiow commented 8 months ago

@openshwprojects shouldn't OBK have the latest lwip anyway or is it not that simple? Are there breaking changes that require other changes within OBK?

CLAM01 commented 8 months ago

I have now done another test. I restarted all Amazon devices. 4x echo dot and 1x echo studio. It looks as if smart devices have now become quieter but not full. Restarts are lower but not away. Could it possibly be a driver problem with the Wemo?

I have look the last 15 minutes. No 10 restarts anymore on one device. Only 1 or 2 restart. Ok, to much in 15 minutes, but I think there is parallism to Wemo and Alexa.

@divadiow did you have Alexa devices too in your network and use SSDP an Wemo driver?

CLAM01 commented 8 months ago

Good morning all. After reboot of all my Alexa devices no restart appears. I think Alexa and Wemo are the problem? All devices online for 9h.

divadiow commented 8 months ago

I have now done another test. I restarted all Amazon devices. 4x echo dot and 1x echo studio. It looks as if smart devices have now become quieter but not full. Restarts are lower but not away. Could it possibly be a driver problem with the Wemo?

I have look the last 15 minutes. No 10 restarts anymore on one device. Only 1 or 2 restart. Ok, to much in 15 minutes, but I think there is parallism to Wemo and Alexa.

@divadiow did you have Alexa devices too in your network and use SSDP an Wemo driver?

yes. I have 3 Alexa devices and have WEMO and SSDP drivers running on my smart plugs.

I also noticed this morning that my newly flashed LED controller kept going off. It seems to be rebooting too, even at lowest dim setting in case I was overloading it. https://www.elektroda.com/rtvforum/topic4026945.html

I still haven't investigated this with any segregation away from Alexa devices or the rest of the network.

CLAM01 commented 7 months ago

After Rebooting all my Alexa devices, actually all smart devices are stable. I think, Alexa send to many requests at some time to the smart devices and the result: heap size error on smart devies.

CLAM01 commented 7 months ago

@divadiow , did you testing, if a restart of your alexa devices mitigate the problem with rebooting?