sblantipodi / glow_worm_luciferin

Bias Lighting and Ambient Light firmware, designed for Firefly Luciferin.
GNU General Public License v3.0
158 stars 19 forks source link

Random connection lost during usage #18

Closed patrick-blom closed 3 years ago

patrick-blom commented 3 years ago

Firefly Luciferin version

1.12.9

Glow Worm Luciferin version

4.9.9

Describe the bug

Setup

This bug occurs in a dual monitor setup which uses two ESP8266 with the full firmware version. The monitors are named left display and right display. Both monitors are connected through MQTT and use different topics for each monitor. If I start up Firefly the order is as follows:

The default color for the led's is configured to blue for both devices and the OS ist Debian Buster 10.9.

The Bug

Random, mostly after at least 15-25 minutes, the right monitor looses the connection to Firefly. The first indicator is that the led's will switch to the default color (blue in my case) and then turn off. After a couple of seconds the esp went away from the device tab and shows up again. Than an out of sync message will pop up and advice me to decrease the FPS. Even if I decrease the FPS, the device does not start working again until I turn off the power and restart Firefly

Bildschirmfoto von 2021-06-02 16-47-58

Bildschirmfoto von 2021-06-02 16-48-20

During this whole bug the other esp on the left monitor works as expected.

Logs

To provide you as much information as possible, I activated the debug mode in Firefly and reproduced the bug.

FireflyLuciferin.log

Side information

My first thought was that my mqtt server could be the problem. So I tested it with both instances on the same topic, but the bug still occurs. Even resarting the server does not solve the problem.

Personal guess

For me it looks like the esp turns off wifi sometimes and than reconnects.

If you need any further information feel free to ping me!

-- Cheers Patrick

sblantipodi commented 3 years ago

@patrick-blom I'm not used to such "quality issues". πŸ˜„ thank you for such detailed issue.

I'll investigate the problem. What MQTT server are you using? Does it runs on a raspberry 3 or 4? Can you send me the logs of the mqtt server when the problem happen?

patrick-blom commented 3 years ago

@sblantipodi sorryπŸ™ˆ I'll try to describe it worse next time πŸ˜„

The used MQTT server is Mosquitto in version 1.6.9-1. It runs on kinda raspberry 3, its an Pine A64+ which's just another single board computer with ubuntu on it, but very reliable the last years.

I was able to grab the log file form the MQTT server during the problem. For better understanding here some side information

All devices use static IPs

mosquitto.log

sblantipodi commented 3 years ago

mmm... to try to understand the cause of the problem I suggest to lower the framerate on both instances, 15 FPS can be ok.

then simply create a custom.conf file into your /share/mosquitto folder, or where your installation is, if you are using Home Assistant that is the folder.

put this into the custom.conf file

set_tcp_nodelay true

please restart mosquitto and try again. thanks.

patrick-blom commented 3 years ago

I reduced the Framerate to 20FPS and set the option in the config. I will test it today and report if the problem still occurs.

I just recognized, that there is also some java logs in my home directory. Hope thats some usefull information.

hs_err_pid2983.log hs_err_pid17083.log hs_err_pid20291.log hs_err_pid20737.log hs_err_pid26168.log hs_err_pid30706.log

Cheers Patrick

sblantipodi commented 3 years ago

ok thanks @patrick-blom, is there some errors in the MQTT server log?

patrick-blom commented 3 years ago

I tested the new config for about 8hrs, but sadly it does'nt solve the problem. There are still error messages in the log file and the problem still occurs :-(

The following errors is recorded in the mosquitto log

1622094343: Socket error on client FireflyLuciferinLinux_-2027816021, disconnecting.
1622094344: Socket error on client FireflyLuciferinLinux_1780893546, disconnecting.
patrick-blom commented 3 years ago

Short update. The problem is still present but not as often as before and the behavior to fix it has changed. Before I had to restart the whole Luciferin application, now the device went away from the device tab and respawns after a couple of seconds. Then I can restart the capturing and it works again.

Sometimes it took longer to respawn, then the FPS message occurs.

Cheers Patrick

sblantipodi commented 3 years ago

@patrick-blom the things that I don't understand is why it happen on the GW_ESP8266_Samsung and not on GW_ESP8266_Acer.

is there something different between those ESP? do they have both a good wifi signal?

by reading your mosquitto log it seems that your MQTT server isn't able to deliver all the inflight messages (stored in memory) and saves them to the db for a later deliver. it seems that this broke the socket and creates the error.

probably the ESP is crashing but I still need to understand why 1 esp crash and not the other. are you sure that the second ESP is connected to a good and reliable power supply with good wifi signal?

As a temporal workaround you can try using a USB cable to send the video signal from firefly to glowworm. This can be done by disabling the mqtt stream in the settings, you need to select the right serial port in the appropriate setting tab then.

still investigating the issue...

patrick-blom commented 3 years ago

@sblantipodi That's also something I'm confused of.

Powerhandling should be no problem because its connected to a 10A supply.

I'll investigate the wifi setup and try the usb option you mentioned.

Thx for your support!

patrick-blom commented 3 years ago

@sblantipodi Ok now, its becomes strange πŸ€”

What I've done:

Now the second Firefly instance (right montior) sometimes dies, restarts, loses both devices and than only finds the Samsung ESP πŸ™ˆ So it might not be Glowworm, maybe it's Firefly, don't know. Currently I'm totally confused πŸ˜‚

Bildschirmfoto von 2021-06-09 08-23-18

I'll investigate further.

Next steps:

sblantipodi commented 3 years ago

@patrick-blom I'll try to explain how the devices tab works.

That tab shows all the Glow Worm Luciferin devices that aren't capturing the screen. When the screen capture starts, GW firmware stops sending his information to all the firefly instances but it sends his info only to its personal firefly instance. (this is needed to improve performance)

it's not a problem if you don't see all the glow worm devices in all the firefly instances, the only device that MUST be present all the time in the devices tab is the glow worm device in use by the firefly instance. if you stop the screen capturing, all devices should appear in all the firefly instances in some seconds.

if you don't want to use USB cable, just don't try this, it is only a temporary workaround to the problem, but if you don't like this solution, just don't do it, we will find a real solution to the problem πŸ˜ƒ

patrick-blom commented 3 years ago

@sblantipodi Ah ok, got it. I'll have a detailed look at the device tab during the day. Maybe I can find some pattern, that help to find the root cause of the Problem.

I think I'm able to replace the esp in the upcomming days, if that fails I'll try the usb variant. Yes I don't like to break my cable management, but I want even more find the root cause πŸ˜„ So if that means some extra work for me, well thats the way to go 🀷 no one said, it will be easy πŸ˜„

sblantipodi commented 3 years ago

Thank you @patrick-blom, I really appreciate your work here :)

patrick-blom commented 3 years ago

@sblantipodi I replaced the esp, but sadly the problem is the same.

Next up is usb variant. One question regarding that topic. Because I'm on a MQTT setup my esp is currently powered through the 5v board pin not trough usb. Do I have to cut that connection in the usb setup or is it safe that the esp fetches 5v from the usb and the board pin?

Of cause in my case 5v on the pin from the psu and 5v through the usb port.

I guess it could end up in magic smoke πŸ˜„

sblantipodi commented 3 years ago

@patrick-blom I have cutted the 5V in the usb setup but in theory it should be safe to leave it on, I preferred to cut it in any case :)

before cutting the wires, please do a last test.

please higher the framerate on both ESP8266 to 1 million. ESP will reboot, it will take some time to reconnect. lower the framerate to 10FPS on both Firefly instances.

Can I ask what router do you have? Do you have 802.11ax / Wi-Fi 6 mode enabled on the 2.4GHz?

I noticed some corruption in the wifi communication between my ESP and the router while the "WiFi6 mode" is enabled on the router on 2.4GHz.

sblantipodi commented 3 years ago

another try, if you can, disconnect every other MQTT devices that are connected to your Mosquitto server, apart the luciferin ones obviosly.

patrick-blom commented 3 years ago

@sblantipodi ok, I'll check it. But how do I set the framerate on the esp πŸ˜ƒ?

Through a MQTT command I guess, right?

I have this kind of router https://avm.de/produkte/fritzbox/fritzbox-7590/. Don't know if AVM sells it in other countries, but in germany it's decent stuff everyone buys. Anyways, my router does not support wifi 6 so I dont think that's the problem.

Sadly, I'm not able to remove all the devices from that server because I life in smart home packed with sensors, led stuff, etc. But I think I can setup another mqtt server with the same version. That would be a clean environment for more investigation.

Will take some time, but I can handle it.

I'll keep you up2date πŸ˜…

sblantipodi commented 3 years ago

@patrick-blom you can set the baud rate of the ESP via Firefly Luciferin software. Just open settings -> Mode Tab -> Select 1 million baudrate, save and close.

@patrick-blom ESP will reboot in 1 minutes, more or less.

I know FritzBox, it's a quality router, it should not be the problem. Trying another MQTT server would be very useful.

sblantipodi commented 3 years ago

@patrick-blom If the esp hangs for some problems on the WiFi/Mqtt connection both esp and firefly will be automatically reset and then screen capture restarts automatically. It will take a while but on the logs it says that is rebooting. When the hangs happen, have you tried waiting if the screen capture restarts automatically? It could take up to 90 seconds.

patrick-blom commented 3 years ago

@sblantipodi I tested the 1000000 Baud for around during the weekend. During that time I could not recognize an outage. I'll do a longer test today. I'll also have a look on the 90seconds reboot 😊

Thx a lot!

patrick-blom commented 3 years ago

@sblantipodi Hold your Beer, It seems to be fixed πŸ˜… I tested the 1.000.000 baud rate on 10 FPS for about 8h now and I had no sinlge outage. Next step is to increase the FPS, will see if the fix still work.

Do you know why increasing the baud rate fixes the problem?

Cheers Patrick

patrick-blom commented 3 years ago

Today I tried 15FPS, still no problem at all. Whoop Whoop =D

sblantipodi commented 3 years ago

@patrick-blom I'm sorry for the late reply. 15FPS is still a bit low, increase it if you want :D

increasing the baudrate helps ESP to strech its legs, don't ask me why but it's something that I always experienced on the ESP devices. highering it too much can cause flickering on the LED but 1 million is ok on CH340 devices like your D1 Mini. I think that the root cause can be related on how ESP manages interrupts/watchdogs, baudrate changes this behaviour.

I experienced similar socket issues previously and the problem was related to two major factors:

1) ESP is not able to keep up with firefly, firefly is sending messages too fast and ESP crashes. 30FPS is the recommended framerate because ESP is completely able to handle it in "normal conditions".

2) weak wifi connection or crowded network. if the 2.4GHz network is too crowded, example if there are a lot of IoT devices, network doesn't work very well and some packets can be corrupted on the ESP side. this situations is managed on the firmware, firefly automatically re-establish a connection in 90 seconds more or less.

setting your 2.4GHz connection to 20MHz on your router helps a lot. 40MHz is not good on crowded networks, 80MHz is pretty pointless.

patrick-blom commented 3 years ago

@sblantipodi No problem :)

I guess the second problem could fit verry well to me, because I'm already on 20MHz but my awg count of active wifi devices is about 40 devices xD. But these are on the 2.4 and 5 GHz band.

But anyways, thx for the insides and your support! I'm verry happy that it actually works again =)

Feel free to close the issue =)

Thx a lot Davide! Thx a lot!

sblantipodi commented 3 years ago

thanks to you for the the detailed report! :) feel free to reopen it in case there is some other weirdness...