nettigo / namf

Nettigo Air Monitor Firmware
GNU General Public License v3.0
33 stars 19 forks source link

Issues on ESP82668266 with 16MB flash #77

Open danielskowronski opened 5 months ago

danielskowronski commented 5 months ago

Summary

When you use an ESP8266 board that has 16MB flash instead of 4MB, system performance is non-existent, most notably the config page does not load 99% of the time.

Tests

Scenario:

  1. Erase entire flash of fresh ESP8266 board that has 16MB memory chip
  2. Flash with any version of NAMF (I believe the same may be the case for original firmware). I was doing most of my tests on NAMF-2020-46rc5-en.bin
  3. Connect to the board in AP mode with any computer that's close to the board. In my environment, NAMF was reporting a stable -31 dBm signal to client device.
  4. Perform time curl -v http://192.168.4.1/config -o /dev/null several times

Expected outcome (true on same board but with 4MB flash):

Actual outcome:

Other observations:

Background on flash itself

ESP8266 usually has 4MB SPI flash. However, some boards marketed as "WeMos D1 Mini Pro" that are preferred for their external antenna are sold with 16MB SPI flash.

That flash chip can be of various quality, but all my boards use Zbit ZB25VQ. Additionally, I'm quite certain that this can be ruled out, as flashing and dumping speeds were constantly the same on all boards.

Moreover, I tested all combinations of below flashing parameters (in esptool.py):

While all those impact actual runtime (NAMF reports it) and there are some minor performance differences, it never solves the issue. It could be too small testing sample, but best results were on 16MB/QOUT/40MHz combination, when loading time of config page that didn't break was 1.5-3 seconds and I had somewhere between 50% and 75% success rate.

SPIFFS is using flash to store configuration, and it seems like during config page rendering it's read many times.

The chip I have on 16MB board seems to relatively rare, but some people reported similar issues on completely different platforms using ESP8266. For reference, it's 25VQ128DSJG.

My suspicions

After some digging, I think the culprit is SPIFFS on 16MB flash in general.

NAMF uses 4m2m layout by default, which means 4MB total flash, 2MB for SPIFFS, leaving 2MB for compiled program. I tried to rule out some other aspects of that profile, which are implemented in Arduino, like block size etc. The motivation is that SPI flash chips actually have internal cache, and cache miss ratio could be high due to different sizes of data. To achieve that, I tried to build an image of NAMF with 16m2m and 16m1m. This yielded little to low improvements.

In other words, I tried both:

Unfortunately, there's no easy way to create a custom profile like 16m14m, as those are compiled into development tools and not parametrized.

https://github.com/esp8266/Arduino/issues/7095#issuecomment-589885905 seems to confirm that SPIFFS is expected to struggle on 16MB chip, and it doesn't matter how much of memory is actually allocated.

I also tried changing code of config page render process to add chunk sending more often and to include free memory stats in HTML response, but it looks like it's not the core of the issue. However, afterr random poking around the code it looks like memory is not explicitly freed - it's reserved on https://github.com/nettigo/namf/issues/77 and probably should be freed here: https://github.com/nettigo/namf/issues/77

Proposed solution

At the moment, we can at least put huge warning in docs, that 16MB flash is problematic.

Ideally, SPIFFS should be replaced with LittleFS, as it's deprecated and expected to be dropped from ESP8266 Arduino SDK completely. On the other hand, it looks like it has own issues and may not solve 16MB variant issues alow.

Web UI rendering rewrite

I was trying to figure out what's the core of the problem, but this spaghetti code originating from the original project is a nightmare. A few months ago, when I was considering porting some feature from the original project (SSL for InfluxDB) I was scared by the number of places I'd need to change to add a simple option.

I hope to find time to work on my fork that would replace HTML and JSON rendering that's currently a series of functions with a more modern approach: static HTML and JS files hosted from SPIFFS and only dynamic content implemented by serving JSON (it can be easily rendered from structs).

This Web UI already uses plenty of JavaScript and XHR to load content (e.g. WiFi list), so I believe it wouldn't be a big deal (i.e. there are no hardcore users with NoScript). Moreover, all endpoints used in external integrations could be left as-is, so it'd stay compatible. Furthermore, it's a weird decision to perform heavy tasks like rendering templates and actual HTML tags on ESP8266, when client devices have several orders of magnitude more power and memory to render those on the frontend.

My fork is https://github.com/danielskowronski/namf/tree/rewrite_webui

netmaniac commented 5 months ago

Thank You for effort you have put into this. Now only few words on topic, I'll try to write a bit more in few days.

Issue 7095 You have mentioned - I believe they say about 16M filesystem not chip size. AFAIK SPIFFS ignores all not used flash, so from data structure filesystem has 2M. I have observed that behavior, and it is one of the reasons we don't sell Wemoses D1 mini Pro. Apparently boards from some manufactures are OK and are working good, and some have this problem. So I'm not sure if this is a software or hardware problem, but I can't rule out either cause. Do Your Wemos have antenna/IPX selecting jumper placed under angle 45° to ceramic antenna?

Regarding changing design of how webconfig works - I did a test with such approach, but since current ESP8266WebSever is single threaded and blocking - performance (as seen by user) was dramatic. I did test with replacing it with ESPAsyncWebServer there was much better user experience, but I had some other problems with this server. Unfortunately I can not find my notes from this project so I can not say what was exactly problem. I believe that async_web is this code.

danielskowronski commented 5 months ago

No, my "pro" board has an antenna resistor at 90° (it's that weird rare variant).

After running into some issues with the standard 4MB Wemos D1 Mini (the one sold as part of NAM kit), I'm not so sure if it's SPIFFS that causes all the issues (but probably still makes it harder to debug).

My NAM 0.3.3 box was working for weeks, but today it disconnected from Wi-Fi and refused to reconnect in a stable manner (when it connected, every endpoint was tragically slow). I actually investigated the issue live for the first time, but it wasn't the first occurrence (although it always fixed itself).

After desperate attempts at re-flashing, I even faced full firmware crashes, but they were probably caused by corrupted config.json stored on SPIFFS. I thought it could be related to https://github.com/esp8266/Arduino/issues/6007 or https://github.com/esp8266/Arduino/issues/5493, but my ancient Wemos from NAM 0.2.1 exhibits the same issue with wake-ups; however, it never actually failed on me with Luftdaten/NAMF. Validated flash by dumping and it's working fine. I even tried lowering Wi-Fi TX power to try keeping flash stable.

I've seen similar issues with SPIFFS + multipart HTTP + ESP8266 in multiple places, but nothing really stood out. I even tried building an image with even more debug and random changes to webserverPartialSend trying to correlate content sent with content received. That, plus going one layer deeper, was the way! Since curl was vague in telling me what actually broke in the HTTP response, I looked at network traffic in Wireshark. And I've seen a sea of red and black lines for spurious retransmits and other out-of-order TCP packets.

After some tweaking, I reset TX power to max (20.5, even if code accepts any number) and changed Wi-Fi mode to 802.11b. I never had luck with N when connecting to my Wi-Fi router, so long time ago, I assumed it had to do with some compatibility issue and stuck to G. However, with B mode, the connection is the most stable and retransmits are very rare. My current theory is that the ESP8266 running NAMF must be somehow so overloaded, it struggles to run more modern variants.

The most bizarre twist: that 16MB "pro" board started to work with 802.11b - it's relatively slow, but renders config page nearly 100% of the time.

With 802.11b and 20.5dBm TX power on rc5, the config page renders on those boards in:

That suggests off-brand SPI flash is significantly slower, at least with SPIFFS.

@netmaniac I've seen some work done on ESP32 variant - how it's going? Ideally, we'd need new main board to support it, but I think that using wire spaghetti to convert some of the available dev boards with ESP32 would be worth trying to fix performance issues in the vast ocean of non-genuine D1 Mini variants. ESP32-CAM seems the easiest board to adapt due to low price, similar size and built-in external antenna connector (we could just ignore dangling camera and take advantage of microSD for logs storage).

danielskowronski commented 5 months ago

I am beginning to wonder if off-brand SPI is just an indicator of poor quality of ESP8266 dev board, rather than root cause.


After some tweaks to my Asus Wi-Fi router (I lost track of what's causing most of the issues, but most likely Airtime Fairness has to be turned off) and sticking to WIFI_PHY_MODE_11B I've run some tests, and it starts to be more of a network issue. I was hinted towards that direction by discrepancies in /config vs /config.json pages - their size is different and chunked response generates many TCP packets, which likely generates some issue similar to memleak, but on TCP stack.

I developed a "test suite" that runs curl to /config 1000 times and records curl exit code (0 for OK, 56 for malformed chunked response) plus execution time. That way, results are somehow comparable and if any memory leak is present it should be visible by series of non-stop requests. Those are executed from a machine connected using cable to router, against firmware run on 16MB "pro" board that started the issue running in forced B mode. Everything compiled using this new platformio target: https://github.com/danielskowronski/namf/commit/6510b4e950b5a4b027117a88cd2c323466ce40be#diff-4446afd728a4f34cbcddc306a9cb6be845d1a61c216076a295683bcc9c106724

One thing that should be ruled out is my Wi-Fi setup, ideally to get reproducible results, it should be re-run on some Raspberry Pi.

First thing I checked was server.client().setNoDelay(1); set before each chunk in webserverPartialSend (see https://github.com/nettigo/namf/blob/beta/src/webserver.cpp#L11):

And this seems to make sense after reading https://arduino-esp8266.readthedocs.io/en/latest/esp8266wifi/client-class.html#setnodelay and the linked Wikipedia article about delayed ACK. Especially when using slow 802.11b/g and having ESP8266 outdoor (plus probably slow flash doesn't help).

This is an example of transmission with Naggle enabled (setNoDelay(1)) - it's working against delayed ACK as chunked HTML is further chunked into smaller TCP packets:

image

And this is with Naggle disabled - chunking only happens to fit into Ethernet frames:

image

Again, it seems that if anything goes wrong with a single TCP packet, poor ESP8266 has to retransmit it, deal with other packets that are now out of order and ultimately fails to render something that makes sense in a reasonable time, thus forcing the client to just terminate connection.

I'll post full test results for WiFi mode x Naggle x board (4MB & 16MB from 2024 plus 4MB from 2019 that's genuine Wemos with 99% certainty). Although from partial tests with 16MB board that switched from B to G, I already started to see original behaviour with extremely long response times.

I'll also try to add parameters for Naggle and some timeouts, especially for Wi-Fi connection, as with N it's often failing to connect within the time limit :/

danielskowronski commented 5 months ago

Mystery of N-mode not working is solved - AP mode is known to work only in B/G, STA should work, but Asus (and some other vendors) are stubborn in following 802.11n specs and silently ignore connections from certain SDK versions that do not advertise WMM (it manifests as established link-layer connection, but no connectivity - most notably no DHCP). Best described on https://github.com/esp8266/Arduino/issues/7965

Moreover, the linked issue also mentions issues with poor quality crystal oscillator that also may be influencing connectivity, especially with TCP. I constantly observe random disconnects on all new (aka non-Wemos) boards when used outside. If that's confirmed, then I think there's no future for supporting ESP8266 if we can't get good quality boards (and we can't get Wemos ones, since they no longer make them).

For me, that was the last nail in the AsusWRT coffin, so I'll soon have Mikrotik setup to play around and determine the true root-cause of those connectivity issues on the NAM end. If nothing else, I'll finally get rid of the incoherent mess that's Wi-Fi configuration on AsusWRT...

But for now, I added PIO_FRAMEWORK_ARDUINO_ESPRESSIF_SDK22x_190313 and it's still not connecting to as N client (I suspect Asus made further "adjustments" to their firmware). On G mode there are huge packet losses, so only B mode works, but that one struggles outdoor...