platformio / platform-espressif8266

Espressif 8266: development platform for PlatformIO
https://registry.platformio.org/platforms/platformio/espressif8266
Apache License 2.0
320 stars 218 forks source link

wifi issues with default compiler optimization flags #288

Open mamama1 opened 1 year ago

mamama1 commented 1 year ago

Hi

we have stumbled across wifi issues which we were able to mitigate by adding a delay(1) (ie effectively allowing background tasks like wifi handling to run) directly in the main loop or when adding serial output messages at random places.

we found that one simple for..next loop (iteration from 0 to 10) seems to add to the issue, however 0 to 10 shouldn't really block the ESP8266 very long, especially since we're only checking a few very simple evaluations there (mostly comparing bool variables). Adding a serial output in a place which isn't even executed 99,9% of the time, also mitigates the wifi issue.

So we came to the conclusion that some compiler optimizations must be messing with us and so we tried to use -O2 instead of the default -Os compiler flags and with that, wifi works flawlessy, without adding any delays or serial outputs in our code.

Since our code is huge and complex, I am not able to post a minimal sketch to reproduce. Any small change to the code can completely change the behaviour. But as an example, I can demonstrate which completely unlogical changes make wifi processing work again:

    for (uint8_t n = 0; n < RFM69_TX_QUEUE_LENGTH; n++)
    {
        // delay(1);
        // Only work on packets where NewPacket = true and if they have retries left.
        // Do not work on Packets where ACKReceived = true, since those have already 
        // been sent AND ACKed by the peer. They are only waiting to be cleared by user code.
        if (this->TXQueue[n].NewPacket == false || this->TXQueue[n].TXRetries == 0 || this->TXQueue[n].ACKReceived == true)
            continue;

        // LOG("%u", n);
        ADDITIONAL STUFF HAPPENS HERE....
    }

So this for..next loop processes waiting packets in a TX queue. Most of the time there are no packets in that queue, so the loop continues right after the first if statement. Wifi is not working (not connecting) with the default compiler flags -Os with the code above. If I uncomment the LOG output AFTER the if statement (which just continues most of the time thus the LOG doesn't even get executed anyway), wifi suddenly connects and works again. If I leave the code as it is and use -O2 instead of -Os, wifi starts working immediately as well.

LOG is a macro which calls Serial.printf(), nothing special. Changing code in a completely unrelated place led us to the conclusion that compiler optimizations must be messing with us and to me it looks like this is indeed the case.

Without deeper analysis of our code - can it be generally said that -Os can be problematic? Is our code probably the issue? What should we be looking for? Is there a way to find out what exactly is blocking wifi from connecting correctly? Should we just use -O2 and be happy?

Can please someone advise whether this is probably an issue with the compiler optimizations (and not our fault) or whether we should dig deeper into our code and give us some directions.

Thanks! PS: made a github issue instead of a forum post bc. maybe this is really related to the chosen compiler optimizations flags.

TD-er commented 11 months ago

Hmm too bad this post didn't get any reply, as I'm really curious whether others may find similar behavior.

I can at least confirm that the WiFi stability on ESP8266 may appear to be completely unpredictable from build to build and even a small completely unrelated change somewhere in the code seems to 'fix' or 'break' WiFi code as you also described.

I always thought there might be some bug related to (string) buffer size being one byte too small or not being 0-terminated somewhere which depends on the order of linking object files. But changing optimization flags may also be a good explanation.

valeros commented 11 months ago

Not sure how to help here as we use the same optimization flags as the Arduino core for ESP8266. Without a minimal project that works with the Arduino IDE and doesn't with PlatformIO, we cannot affirm that this behavior has something to do with PlatformIO's build process.

valeros commented 11 months ago

@mcspr correctly suggested that an outdated local toolchain package might be the culprit. The new platform v4.2.1 contains a stricter minimal toolchain version requirement so that PlatformIO is forced to use the latest packages with bugfixes.