letscontrolit / ESPEasy

Easy MultiSensor device based on ESP8266/ESP32
http://www.espeasy.com
Other
3.26k stars 2.21k forks source link

Wifi issues -never ending story- go back to non event based wifi? #1302

Closed TD-er closed 5 years ago

TD-er commented 6 years ago

As a lot of you have noticed the last few weeks, there have been lots of issues with the wifi. This all started when I changed the way wifi operates to be event based.

Some of these errors are core version related, and update to core 2.4.0 does introduce lots of other issues. And then there is the problem with corrupted settings what was also in this period. That wasn't related to the wifi event based connect, but it made me look for a lot of other issues that were not really issues at all but just corrupted settings.

So at the moment the wifi state machine I wrote is overly complex due to the many fixes that were no fixes, because things were not broken. And still there are other real issues, either caused by core 2.4.0 or still open wifi issues.

So now we have to choose:

  1. Go back to sloooowwww but stable wifi (still some issues then with MQTT when connection is lost)
  2. Invest some more time to get event based wifi just right + try to get core 2.4.1 working.
  3. Invest some more time to get event based wifi just right, but still go back to core 2.3.0
  4. Some intermediate solution to do async wifi with core 2.3.0

Core 2.3.0 does seem to give a lot less issues and leaves more free memory. So I guess that's my preferred base. This means that for event based wifi, there is still some issue with respect to loading the setup page when initial config is needed.

Anyway, this has to stop now and get stable again. There are currently way too many issues at hand that are quite hard to see as separate issues.

Any other suggestion?

Budman1758 commented 6 years ago

I really can't speak to this from a programming level but it seems to me from what I have been seeing is that with the exception of the static ip address thing is that when setting up a "brand new" unit the wifi seems to work fine. I have not seen any connection issues with "fresh" installs with the latest firmwares. Web pages load fast and the entire things seems fast and responsive. Its when you try to upgrade is when most of these issues seem to be happening. Seems like there is a corruption issue when upgrading to a newer firmware.

I also notice that it seems to be a lot of user compiled firmwares are having wifi issues. Just from reading thru all these issue posts I get that impression. I could be completely wrong about that though. I am not trying to say that as a fact, but just a possibility.

I can't speak to MQTT because I don't use it.

Just my 2 cents worth.....

Grovkillen commented 6 years ago

If you are leaning towards option 3 I support you fully. I'd hate to see us drop the improvements your event based WiFi has given us. Core 2_4_x might be easier to revert/go to up stream?

DittelHome commented 6 years ago

From the perspective of a user: I would go ahead and use the new Core 2.4.1 as soon as possible. The users can always use older versions.

DittelHome commented 6 years ago

Dont forget, core 2.4.x fixes some problems: PWM flicker is history (#1156 is fixed with core 2.4.0) Serial with large packet are also fixed... At some point we have to make the transition to the new core. Return to 2.3.0 means only to postpone the problem. In the end we have to do the work anyway. My ESP's are definitely better with 2.4.0

Grovkillen commented 6 years ago

As I see it, core 2_4_x will happen but maybe not necessary as of right now. We did a bad decision when we went ahead with core update and wifi event based approach at the same time. We should have made them one after another. When we then, at the same time, had an update in the global settings the problem got exceptionally hard to pinpoint. I strongly support the idea of going back to 2_3_0 during the fix of wifi stability + fix of settings corruption.

After that we can hopefully release the v2.1.0 and then focus on getting core 2_4_x stable for v2.2.0

melwinek commented 6 years ago

After clearing the settings and uploading the version from 22.04. So far everything is working. At least for now :) Only free memory is not enough, even in NORMAL. We'll see how it will go on.

giig1967g commented 6 years ago

I have to agree with @Budman1758 and @melwinek : I also found that starting from a clean unit there are no problems at all with Wifi, static IP and settings. The main issue is the fact that to upgrade I now need to manually clean all the units, reflash them and rebuild their configuration.

Grovkillen commented 6 years ago

I guess we should not forget that officially we're still in the process of going from stable R120 to stable 2.1.0 and settings will not be converted between these two releases making you need to start from scratch anyway. What we did with the update of core 2_4_x was to make a "break point" yet again. If we can live with that then its not a problem. I agree that a clean install is really stable (at least on NORMAL, which I test most frequently). And NORMAL is the only part which will actually be in the release, test and dev is only in the development nightly release anyway.

giig1967g commented 6 years ago

What I mean is: if the current developed firmware works and is stable on a clean setup, then it means that there is nothing wrong with it. I wouldn't go back to 2.3 or to old wifi.

Grovkillen commented 6 years ago

Yes I hear you and I kinda agree. The only thing is that we create another break point which I guess is okay since it's still beta.

ghtester commented 6 years ago

Although it is a step back that I do not like, I'm afraid that it would be really better to go back to core 2_3_0 for now as I think some strange issues may happen due to lack of free memory on 2_4_0.

Budman1758 commented 6 years ago

@giig1967g I agree with you there. I do believe there are some corruption issues going on though. Might be whats screwing up the wifi vs its having a lot of inherent problems.

TD-er commented 6 years ago

There are still options to get memory usage to an acceptable point. I think I can get about 3 - 4 kB more memory in like 1 evening of programming. (need to change all plugin files though) And MQTT import is also something which is really a pain that should be resolved soon. And the Switch plugin does have too much functionality it itself which should be split.

I will think about it today, what we should do, so please add more suggestions/arguments :)

melwinek commented 6 years ago

@TD-er You're right with SWITCH. Most people use only ON / OFF for switch / relay. And in this plugin there is a servo, dimmer and probably something. That could be separate.

TD-er commented 6 years ago

It is also handling stuff very specific to MQTT and/or Domoticz. That should not be part of the plugin.

melwinek commented 6 years ago

@TD-er In many cases, it would help me to compile myself, after removing unnecessary plugins, in many cases I need only SWITCH, FHEM Controller, DHT. But after these adventures with settings I'm afraid to compile myself. Especially after your post: https://github.com/letscontrolit/ESPEasy/issues/1292

M0ebiu5 commented 6 years ago

Did you take a look, how wifi is realized in other projects (eg tasmota)?

About memory: i told you :smile: I think there are way too many rarely used features in the core - the decision, if a core feature request is implemented should be way more strict - now it's a little like Christmas for everyone... Maybe a voting with a certain limit would help.

If possible, the core should first be cleaned from that rarely used features (or transformed to a plugin) and then optimized. Also, one could think about additional interfaces for plugins, to allow swapping more functionality outside of the core.

TD-er commented 6 years ago

@M0ebiu5 Agree. What should happen is that new features will be developed on a separate branch, then collect a few of them and merge those to a release candidate branch and test those. Then release and merge the used features to the master branch (or dev branch, or whatever you name it).

And one thing I learned is to ask twice about what is observed, what should be observed and what version is used. That will make things a lot more clear and lead to less mistakes. Part of that must be done in the code itself to make some kind of footprint to be able to see (and log) what software is used.

Also plugins should be just plugins to interface a sensor to some output values. Maybe plugins that generate output (like displays) should not be used the same as the input ones. So we get something like:

But such a redesign will take quite some effort.

M0ebiu5 commented 6 years ago

@TD-er you are right, but i would make the changes in small steps - cause most parts are working stable and big changes could put this stability at risk.

New interfaces to the core are one possible way. They will not influence the current behavior and only new or heavily changed plugins will use them. It will take more time to transform to a clean architecture, but with a lower risk and the effort will also be spread over time.

TD-er commented 6 years ago

I agree that these changes should be done at ease. It is more a view on the redesign for the future.

melwinek commented 6 years ago

However, node from 22.04 has lost the connection. Resetting the router does not help. ESP reset will help, but I'm far away. So, the best version on my nodes is mega-20180410. Maybe because it's on core 2.3? Maybe, however, a good solution would be to go back to 2.3 for some time?

TD-er commented 6 years ago

Nope, last night I saw the problem (in the code and happening at my own units). My nodes did not reconnect when they got a 'beacon timeout' error, which is quite a common reason to disconnect. It is a logic error in the code, but it was already past 1:30 am and I didn't want to fix it at that moment. It would certainly have been past nightly build time to fix it, so that didn't matter anymore ;)

s0170071 commented 6 years ago

related: #1064

micropet commented 6 years ago

I just flashed 6 devices with the current version ESP_Easy_mega-20180425_test_ESP8266_4096.bin.

I think with this version we have reached an absolute low point. All devices could not be reached in the network after a few hours.

TD-er commented 6 years ago

That's why I will -one way or the other- return to a working wifi version.

Maybe we should also remove build of today, just to prevent others from loading the same.

melwinek commented 6 years ago

I would suggest to build now a new version 04.25 on core 2.3.0 and replace the current one :)

TD-er commented 6 years ago

I do not have control over the build server. and 2 versions with the same build number is never a good idea.

I know @Grovkillen can remove today's build.

Grovkillen commented 6 years ago

You think I should remove it @TD-er ? A new one will be build tomorrow.

TD-er commented 6 years ago

Apparently it is even worse compared to yesterdays build. So yeah, remove it.

Grovkillen commented 6 years ago

Done

TD-er commented 6 years ago

And platformio.ini is also changed. So whatever happens, tomorrow's build will not be as bad as today's.

s0170071 commented 6 years ago

I sense panic around here. Dont't worry, there is still hope. First, lets have a :beer: or :beers: Now, thank you @TD-er for storming ahead so restlessly despite your worries, thank you @Grovkillen for supporting him. Thanks to all others as well. And thanks to everyone for discussing new ideas so openly.

That being said, let me say this:

  1. According to my experience there is nothing wrong with newer core versions except for 2.4.1 which has a wificlient memory leak (and a workaround).
  2. Older versions of the master branch up to the point when there was a decision to abandon 2.0 worked quite stable with those core versions. And fast.
  3. We really should (emphasis on really !) focus on stability. No new features for a while (unless the improve stability) less ESP32 and less memory hunting, less speeding up less everything but stability. Let's pretend we plan to fly to the moon. For real. That thing needs to operate. It needs to tolerate single bit failures, restarts, power fluctuations and temperature stress. I mean fail tolerant programming. Done it. It can be fun if you wrap your mind around it.

Whats next ? If I was a core dev, I would opt for that json based config. Asap. Seems like the current root of evil just like the memory intensive web server was a while ago.

DittelHome commented 6 years ago

I sense panic around here. Dont't worry, there is still hope. First, lets have a 🍺 or 🍻 Now, thank you @TD-er for storming ahead so restlessly despite your worries, thank you @Grovkillen for supporting him. Thanks to all others as well. And thanks to everyone for discussing new ideas so openly.

100 % agree !!!

I dont know if this error is already known: After one or two days, it seems that the webserver dont work anymore. MQTT publishing is still working. I am using the normal version from 04.22.

uzi18 commented 6 years ago

@TD-er personally i will vote for something new like 2.4.1 to try.

micropet commented 6 years ago

1+

susisstrolch commented 6 years ago

1+

TD-er commented 6 years ago

I sense panic around here. Dont't worry, there is still hope.

It is not panic, it is pure frustration ;) Thing is that I really test stuff here and within minutes (at least it does feel like that) builds fail even worse than before. I am used to program against black boxes, and also rev. engineer those black boxes. But this feels like the feedback I think to see in the logs is completely different from reality. Now it is clear there were several issues the last few weeks, due to bugs in core libraries, corrupt settings and a few changes I made appeared to be related to some bugs in AP firmware.

And my personal opinion about software is that it should be rock solid stable and speed comes second. But last weeks the speed increase is OK, but the stability was worsening by the day, no matter what I tried.

So now the time has gone to make a firm halt and focus on stability first. You could call that 'panic', but actually it is some kind of step back to really focus on what's going on. I now know a lot more about wifi than a month ago, so I should be able to make a well designed package. But that takes time and I really want to go to some point of stability and get some moment of ease in my head to make it working like it should. And then there is still a lot of room to make stuff even faster, because I've seen ik connect even faster :) But that's for the next version.

About the rest of the main issues:

clumsy-stefan commented 6 years ago

for me (and probably just for me) moving to 2.4.1 or even GIT Core did not improve, the opposite was the case. I tried about 20 different combinations of core version, mage-commits and lwIP versions. Going back to 2.3.0 especially lwIP 1.4 was the only way to get it running stable. But again, just my view of this in my specific environment...

And yes, big thanks @TD-er and @Grovkillen for the great work they do and time they invest for the community!

Grovkillen commented 6 years ago

Thanks you all, @TD-er summarize the road forward pretty good.

About the rest of the main issues: • memory usage • JSON import/export of settings • MQTT import redesign • some plugins like P001-switch should be changed. • What's left.

And we will revert back to 2.3.0 for tomorrows release and test that out for a while.

susisstrolch commented 6 years ago

Most of my self build images are based on core rev. 491c9b8b (2.4.1 + x). Only thing I see are random reboots with my Sonof 4ch device. Unfortunatelly it's part of my pond control, so no chance to connect a serial interface for better monitoring, Syslog is pretty unusable, 'cause the relevant info is spit out before WIFI is up and running.

It's pretty usable - as long as you use the lwIP 'v2 Higher Bandwith' library. Otherwise you'l see problems with MTU fragmentation with packages > 512Bytes (out of order, with improper window information).

The working ESPEasy Revs (my repos's) are

commit 3576619181926b3adff5a1a133390eb71e808ae9 Merge: 9038bd2 d083a58 Author: Susis Strolch Date: Fri Apr 13 17:07:30 2018 +0200

Merge remote-tracking branch 'upstream/mega' into mega

* upstream/mega:
  automaticly updated release notes for mega-20180413
  [wifi] Event based wifi, fix set AP and crash on start

and commit daf39a064d3633fe1eccfa33576fafbccd7611a7 Merge: 2a96218 806a275 Author: Susis Strolch Date: Mon Apr 9 09:15:52 2018 +0200

Merge remote-tracking branch 'upstream/mega' into mega

* upstream/mega:
  automaticly updated release notes for mega-20180409
  Both reset/factoryreset option
  Factory Reset (not enabled yet)

Any ESPEasy after the Fri Apr 13 shows horrible - speaking non-working - results, even when erasing the whole flash before flashing the binary (via Arduino IDE).

So, I'd suggest to go with 2.4.1 (or later) and polish ESPEasy (WIFI and config). Core itself seems to be ok so far.

Oxyandy commented 6 years ago

lol @Friday the 13th, an unlucky day indeed.. So what is "polish ESPEasy (WIFI and config)." ? A different branch..? Poland aka Polish or polish as in buff & shine, ha

Oxyandy commented 6 years ago

@susisstrolch How do I catch errors like this ? "problems with MTU fragmentation with packages > 512Bytes (out of order, with improper window information)."

uzi18 commented 6 years ago

Just send prepared request for espeasy webserver, with additional header with minimum 512 chars

susisstrolch commented 6 years ago

@Oxyandy: running tcpdump on my FHEM server and analysing with WireShark, I found that the last 512 byte of a ~700 Byte JSON response was send first, followed by the HTTP header. And those two packags where simply missing the TCP window information. Can send more details on request... polish as in buff & shine

micropet commented 6 years ago

For me, the version of 22.04.2018 with Core 2.4.1 runs quite well. sysinfo

TD-er commented 6 years ago

Could you also check my work of yesterday, but then built on 2.4.1? https://github.com/TD-er/ESPEasy/tree/bugfix/wifi_stability

In 2.3.0 I still had issues with static IP. Did not yet test the AP mode with setup page.

micropet commented 6 years ago

I just flashed a wemos with your version [wifi] Attempt to make event based wifi simpler).

Vesion runs.....

What should I do now?

micropet commented 6 years ago

Shit, from my last Test (22.04.1018) 4 of 8 devices hung up after about 7 hours.

TD-er commented 6 years ago

I guess no log? :( Did the nodes crash (hang) or just not reconnect? Do they reply to ping, and thus only the webserver disabled or too busy (MQTT reconnect takes a lot of resources) ?

micropet commented 6 years ago

Meanwhile hang 5 devices. I do not have a log. The web server is not accessible. Ping does not work either.

they are just dead.