me-no-dev / ESPAsyncWebServer

Async Web Server for ESP8266 and ESP32
3.77k stars 1.22k forks source link

Bug: Using server.serveStatic causes WDT crashes almost every 2nd time for files >150K #984

Closed ullix closed 2 years ago

ullix commented 3 years ago

I have static, binary log files which I download via WiFi. Works well most of the time, but then there are perpetual crashes. It seems the success depends on the size of the files. Up to 64k mostly ok, 64k ... 128k ok for half the trials, beyond that size almost never ok.

I am using the LittleFS file system. The files are build up of 32 bytes binary records, and may eventually grow >2MB. I use this command for downloading as static file:

// file system definition:   fs::LITTLEFSFS  * myFS       = &LITTLEFS;
server.serveStatic("/log.cam",      *myFS, "/log.cam");

I also tried this. Works just as well, but no improvement:

server.on("/log.cam", HTTP_GET, [] (AsyncWebServerRequest *request) {request->send(*myFS, "/log.cam", application/octet-stream");});

The download is triggered from a website, either as a straight link to download the binary file, or as Javascript code inside a function:

Link:   "<a href='/log.cam' >CAM</a>"
...
JS:   const fetchdata     = await fetch('/log.cam', {cache: "no-store"});

Your ReadMe seems to suggest that chunked responses are NOT needed for static files? So, I did not use that here. Am I forgetting any settings, or do I also need to use chunked response when the static file is beyond a certain size? Which size?

ullix commented 3 years ago

Upon further investigations I conclude that this is a bug. For me a very serious one as it drastically reduces download speed by a factor of more than 10!

As first test I also coded a chunked response, but as expected, it showed the same failures as reported above. However, this chunk code now allowed to define the chunk size, and this made a huge difference!

I created a 300k (307200 B) file and downloaded that in various configurations. In all tests it always crashed with a wdt message when the download touched 100000; I observed only the two values of 100030 and 101459. Message was:

E (144357) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (144357) task_wdt:  - async_tcp (CPU 1)
E (144357) task_wdt: Tasks currently running:
E (144357) task_wdt: CPU 0: IDLE0
E (144357) task_wdt: CPU 1: loopTask
E (144357) task_wdt: Aborting.
abort() was called at PC 0x401058a4 on core 0

I used various ESP32-WROOM-32E (all latest revision 3) with Flash of 4MB, 8MB, and 16MB. All gave the same result.

Using the code included below I tested for the impact of chunk size, and got a big surprise:

I tested the file systems FFat and LittleFS. Outcome was identical.

I noticed that the download was very slow, slower by about an order of magnitude!

So I used a file of 90k, which could be downloaded under all conditions, and measured download times:

Using 1028 Bytes makes it 15 times slower, but is the only option for big files, when you need speed most!

And one last surprise: removing the print options in chunk-code made it even slower, not faster!

Here is the relevant code:

// File system
// either:
    fs::F_Fat       * myFS       = &FFat;
// or:
    fs::LITTLEFSFS  * myFS       = &LITTLEFS;

// ESPAsyncWebServer setup
    server.on("/log.cam",       HTTP_GET, [] (AsyncWebServerRequest *request) { staticDataType = CAM;
                                                                                request->sendChunked(text_plain, onStaticDownLoad);});

// chunk code
size_t onStaticDownLoad(uint8_t *buffer, size_t maxLen, size_t index){

    static  char *   filename;                 // file to be sent as chunks
    static  File     chunkfile;                // file handle of filename
    static  size_t   dnlLen     = 1028;        // limit to buffer size
            uint32_t countbytes = 0;           // bytes to be returned

    if (index == 0)  {
        if (staticDataType == CAM)  filename    = LOGCAM;
        else                        filename    = LOGCPS;
        samm->printf("onStaticDownLoad: START filename: %s index: %6i maxLen: %5i\n", filename, index, maxLen);
        chunkfile       = myFS->open(filename);
    }

    chunkfile.seek(index);
    countbytes = chunkfile.read(buffer, min(dnlLen, maxLen));

    if (countbytes == 0) {
        chunkfile.close();
    }
    samm->printf("onStaticDownLoad: f: %s index: %6u maxLen: %5u countb: %5u \n"
                    , filename
                    , index
                    , maxLen
                    , countbytes
                );

    return countbytes;
}
ullix commented 3 years ago

This is still a major problem for me, so I investigated further. To simplify I only used the webserver with command

server.serveStatic("/bindataST",      *myFS, "/data.cam");

and only on ESP32-WROOM-32E (revision3, the latest models) with 4MB Flash with the LittleFS file system. I created binary data files of various sizes. (While saved in binary mode, all bytes were printable ASCII, so the file content was legible to humans).

The data shown below were compiled with Arduino 2.0 beta 7, and ESP core 1.0.6, but I repeated some tests with Arduino 1.8.13 and ESP core 1.0.4 and got same results.

I automated this by creating an auto-refreshing website containg JavaScript, by which a serveStatic function of the webserver was called, and logged to the Serial Terminal.

Up to a file size of 150 Kibibytes all downloads were always successful. In an overnight run with a 90KB file all 15833 downloads were successful, not a single failure.

However, with only a few bytes more, 160 Kibibytes, the number of Watch-Dog-Timer induced crashes increased progressively. All were of the same type:

E (69589) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (69589) task_wdt:  - async_tcp (CPU 1)
E (69589) task_wdt: Tasks currently running:
E (69589) task_wdt: CPU 0: IDLE0
E (69589) task_wdt: CPU 1: loopTask
E (69589) task_wdt: Aborting.
abort() was called at PC 0x4014246c on core 0

The successful downloads decreased to as low as 58%, i.e. almost every 2nd download attempt resulted in a crash and reboot! Somewhat strangely, beyond a file size of ~300K the success rate increased a little bit, and was 70% at 1.4MB, the biggest file possible in this setting.

Auswahl_015

image

As I had mentioned on previous posts, one can mitigate this WDT problem by using chunked downloads. However, this slows down the transfer drastically by 40...50 times! Here are two examples done subsequently:

This shows the web page result for serveStatic (picking one when it worked!):

Binary Data in Human-Readable Format
#Total: 1433600 bytes, 1400 kB, 44800 records, dur: 2834 ms, speed: 506 kB/s
1 0123456789ABCDEFGHIJKLMNOPQRSTUV
2 0123456789ABCDEFGHIJKLMNOPQRSTUV

A speed of 506 kB/s is very nice; top-speed observed was even a fantastic ~800kB/s!

With a chunked download you need to limit the maxLen of the chunk to 1028 bytes, or you face again the WDT crash issue. With that limit this is what you get:

Binary Data in Human-Readable Format
#Total: 1433600 bytes, 1400 kB, 44800 records, dur: 124706 ms, speed: 11 kB/s
1 0123456789ABCDEFGHIJKLMNOPQRSTUV
2 0123456789ABCDEFGHIJKLMNOPQRSTUV

Instead of 3 sec it now takes over 2 min, a speed of only 11 kB/s!

I am building a device which logs data, and I will use even the bigger 8MB and 16MB ESP variants and files filling the Flash. This webserver cannot handle this!

BlueAndi commented 3 years ago

You download the files in the context of the web server (Async TCP task). Did you try to move it to the loop context?

ullix commented 3 years ago

@BlueAndi How can I do this? I think it is not possible, but please, correct me.

My webserver initialization gets this code:

AsyncWebServer      server(80);
void initWebServer() {
    ...
    server.serveStatic("/bindataST",      *myFS, "/data.cam");
    server.begin(); 
}

Then from a website I call <ESP-IP>/bindataST to download file "data.cam" from my file system LittleFs. That's it. Which part can I relocate?

BlueAndi commented 3 years ago

You can't do this with serveStatic(), but you can handle the request by your own. Please note, below is just a rough principle I could imagine, which would be worthful to check. To be task safe, you can later use a freeRTOS queue, instead of a deferredRequest variable.


static AsyncWebServerRequest* deferredRequest = nullptr;

static void downloadPage(AsyncWebServerRequest* request)
{
    deferredRequest = request;
}

...
void setup()
{
    ...
    server.on("/bindataST/data.cam", HTTP_GET, downloadPage);
    server.begin();
    ...
}

...
void loop()
{
    ....
    if (nullptr != deferredRequest)
    {
        deferredRequest->send(LITTLEFS, "/bindataST/data.cam", String(), true);
        deferredRequest = nullptr;
    }
    ...
}
zekageri commented 3 years ago

Wow. This is gold

ullix commented 3 years ago

That sure is an idea! I hadn't expected it is possible. But it works.

However, only to a degree.

I put this in my code and tested file sizes up to 1400k on my ESP32-WROOM-E with 4MB Flash. The most sensitive file size range as shown in my graph above https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-845206395 seemed to be around 300k. And again, with 200 ... 300 downloads on each file size, I saw WDT crashes only at 300k and 400k, but it did crash.

Then I pulled out my ESP32-WROOM-E with 16MB Flash and created a 12MB(!) file, and repeatedly downloaded over night. The downloads were a very decent 650kB/s, but that still takes near 20sec. But this time almost 20% of all download attempts crashed with a triggered Watch-Dog-Timer following the command deferredRequest->send(*myFS, "/data.cam", String(), true);:

Loop: deferredRequest handling
E (30449) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (30449) task_wdt:  - async_tcp (CPU 0/1)
E (30449) task_wdt: Tasks currently running:
E (30449) task_wdt: CPU 0: IDLE0
E (30449) task_wdt: CPU 1: loopTask
E (30449) task_wdt: Aborting.
abort() was called at PC 0x40140ae4 on core 0

What I find strange is the async_tcp (CPU 0/1) message. What does it mean? Is the WDT triggered from both CPU cores at exactly the same time? Out of a total of 293 crashes, 292 were of this (CPU 0/1) type, and only 1 of (CPU 1) type, none of (CPU 0) type.

So, while greatly improved, this is not the final solution. What else can I do?

zekageri commented 3 years ago

This seems the common async server crash problem to me.

Can you replace the ESPAsyncWebServer lib with this forked version by Depau and try again?

I'm currently testing this fork and it seems very stable

ullix commented 3 years ago

I'll try. Looking at both ReadMEs, the Depau version seems to be taken verbatim from me-no-dev. Where is the difference in the code? Which part is changed?

One other thing bothers me: Both parties offer their code is explicitly for ESPs. I learned that neither PROGMEM nor the F() macro make any sense on these CPUs (https://www.esp32.com/viewtopic.php?f=19&t=20595 ). Yet, both offerings expressly advertise for PROGMEM and for F(). Are they not aware of this? I don't believe that. So, why this nonsense?

zekageri commented 3 years ago

Here is the PR by Depau He fixed some linked list problems and things like that.

About the PROGMEM and F() things i don't know. Probably compatibility solutions between esp8266 and esp32 and things like that. There are so many esps around, you never know what was Me_no_dev's intention with these.

zekageri commented 3 years ago

I'll try. Looking at both ReadMEs, the Depau version seems to be taken verbatim from me-no-dev.

Yes. It is a fork from me-no-dev's lib. The readme and almost everything is the same except the linked list and some other fixes.

Pablo2048 commented 3 years ago

One other thing bothers me: Both parties offer their code is explicitly for ESPs. I learned that neither PROGMEM nor the F() macro make any sense on these CPUs (https://www.esp32.com/viewtopic.php?f=19&t=20595 ). Yet, both offerings expressly advertise for PROGMEM and for F(). Are they not aware of this? I don't believe that. So, why this nonsense?

The reason for this is compatibility - ESP8266 requires using PROGMEM/F() if you don't want to waste the RAM. Definitely NOT a nonsense...

ullix commented 3 years ago

Getting the Depau lib is not very promising, so far:

I downloaded the ZIP which gave me a file of this name: ESPAsyncWebServer-partial-header.zip. The 'partial' is irritating, but I tried several times and Firefox tells me the download had correctly completed.

The tried to install in Ar1.8.13 and got this message: Auswahl_020

I tried Ar2.0beta7 and got this: Auswahl_021

Then I tried $ gh repo clone Depau/ESPAsyncWebServer and got this: ERROR: ld.so: object 'libgtk3-nocsd.so.0' from LD_PRELOAD cannot be preloaded (failed to map segment from shared object): ignored.

Now, what?

BlueAndi commented 3 years ago

What I find strange is the async_tcp (CPU 0/1) message. What does it mean? Is the WDT triggered from both CPU cores at exactly the same time? Out of a total of 293 crashes, 292 were of this (CPU 0/1) type, and only 1 of (CPU 1) type, none of (CPU 0) type.

So, while greatly improved, this is not the final solution. What else can I do?

The async_tcp (CPU 0/1) sounds for me like, the loopTask is calling the send() function, which try to interact with the LwIP stack. But the LwIP stack is already waiting for the async_tcp queue, but the queue might be full. This causes the deadlock, just a rough guess.

To debug it down, I propose the following actions, each to be executed independently:

One other thing bothers me: Both parties offer their code is explicitly for ESPs. I learned that neither PROGMEM nor the F() macro make any sense on these CPUs (https://www.esp32.com/viewtopic.php?f=19&t=20595 ). Yet, both offerings expressly advertise for PROGMEM and for F(). Are they not aware of this? I don't believe that. So, why this nonsense?

Just a note to PROGMEM, it depends on the MCU architecture (harvard, von neumann) and to say the truth, there might be exceptions depended on the used adresses.

zekageri commented 3 years ago

Getting the Depau lib is not very promising, so far:

I downloaded the ZIP which gave me a file of this name: ESPAsyncWebServer-partial-header.zip. The 'partial' is irritating, but I tried several times and Firefox tells me the download had correctly completed.

The tried to install in Ar1.8.13 and got this message: Auswahl_020

I tried Ar2.0beta7 and got this: Auswahl_021

Then I tried $ gh repo clone Depau/ESPAsyncWebServer and got this: ERROR: ld.so: object 'libgtk3-nocsd.so.0' from LD_PRELOAD cannot be preloaded (failed to map segment from shared object): ignored.

Now, what?

You have to unzip the lib, and i suggest to you to use PlatformIO with VSCODE.

ullix commented 3 years ago

You have to unzip the lib.

I don't believe so. Both Ar1.8 and 2.0 have an explicit option to install ZIP libs. Like here Ar2.0

Auswahl_023

ullix commented 3 years ago

@BlueAndi I tried your 2nd option first and put this

#define CONFIG_ASYNC_TCP_RUNNING_CORE 0
#define CONFIG_ASYNC_TCP_USE_WDT 0

at the top of the code. Compiled and ran, made a 300k file and downloaded. Out of 10 attempts to download, only 1 succeeded, and 9 crashed. All for WDT, and all with - async_tcp (CPU 0/1). That is the worst-ever result.

I can only use IDEs Ar1.8 or Ar2.0. I have Platformio installed, but so far have troubles to run it (some installation problem).

What can be done? I can offer my test sketch, if you want to try it.

BlueAndi commented 3 years ago

Hmm ... I don't believe that the task watchdog was deactivated. I use platformio inside VSCode, as it offers more options. Provide your test sketch and I will try during the week.

ullix commented 3 years ago

Auswahl_038 I am still seeing these annoying crashes, though progress has been made. I am showing an encouraging update, and the code at the end.

Here is the current stand as a graphics in the exact scale as above (https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-845206395).

Much, much better, and basically all due to BlueAndi's suggestion for deferring the download execution into the Loop. But still some 1% crashes, which strangely seem to be largely around the 300k.

Here a picture on the download speed:

image

The smaller files may be impacted by the download overhead, but from 500k onwards 700kB/s are reached and held. I have seen better - up to 800kB/s - with this chip, but 700 would be quite alright.

These data were obtained with the code, I am attaching. Please, see the READ.ME for usage instructions. I ran it under Arduino 1.8.13, Arduino2.0beta7, and the uecide IDE. Let me know of any issues.

TESTPUB_async_server.zip

EDIT: I forgot: in the IDE select "Core Debug Level = VERBOSE" !

zekageri commented 3 years ago

My async server prject running 4 days straight for now ( with continuous up and downloads ). Can you try replace me_no_dev's ESPAsyncWebServer lib with this forked one from Depau?

My below link was not the full fixed fork

gnalbandian commented 3 years ago

Folks, how are you.

Many of the problems with this library reside in the undelying use of AsyncTCP or ESPAsyncTCP library aswell. There many forks that have already got this working steadily.

The combination I' currently using is:

Lorol's ESPAyncWebServer fork: Lorol's

Adam5Wu ESPAsyncTCP fork Adams

Not sure if yours specific problems are solved by this combination, but it worked for me. No more crashes.

Also @philbowles have made his own version of this libraries and seems to work fine. Please check: https://github.com/philbowles/PangolinMQTT https://github.com/philbowles/ESPAsyncTCP-master

ullix commented 3 years ago

Can you try replace me_no_dev's ESPAsyncWebServer lib with this forked one from Depau?

@zekageri : This time it worked well to download the ZIP and install.

I ran it with the code I had published above. Unfortunately, it crashed as happily as the original me-no-dev code! At best, every second download crashed, but on average it was closer to 2 out of 3 crashing!

So this lib is not an improvement.

Am I correct that your 4-day-running project did not use my code, and that the downloaded files were smaller?

The sweet spot for bad behavior seems to be a 300k file. I make a binary file of records with 32 bytes of printable ASCII chars (makes it easier to read a "binary"), 9600 records, 307200 bytes.

I semi-automate the downloads by using HTML to auto-refresh the web page every 10 sec, and then count reboots and successful downloads from the Serial Monitor log.

I observed in some trials that having Core Debug Level to anything but 'None' made it worse; though that also was inconsistent. Anyway, I suggest to set Core Debug Level=None, and also comment out the esp_log_level_set("*", ESP_LOG_VERBOSE). It remains confusing.

ullix commented 3 years ago

Lorol's ESPAyncWebServer fork: Lorol's

@gnalbandian I tried this Lorol version of the webserver, but nope, same bad performance as with the others. A successful download about once in three trials!

Have you guys really tried to download 300k files? See my files comments here https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-852263756 and in the source code, attached here: https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-849659026

Am I the only one with highly reproducible problems in downloading big files?

gnalbandian commented 3 years ago

Have you replaced ESPAsyncTCP library aswell?

BlueAndi commented 3 years ago

Have you replaced ESPAsyncTCP library aswell?

For esp32 AsyncTCP must be replaced, ESPAsyncTCP is only for esp8266.

Funny ... I have my own fork of both AsyncTcp and ESPAsyncWebServer. :-D

BlueAndi commented 3 years ago

@ullix Just to be sure, but you don't have the TESTPUB_async_server.ino and etc. as a vscode + platformio project available?

BlueAndi commented 3 years ago

And again to be sure, can you ensure that the downloadPage() is not called during a already running download?

Currently it is not ensured:

static void downloadPage(AsyncWebServerRequest* request){

    Serial.println("2 webserver: downloadPage");
    deferredRequest = request;
}
ullix commented 3 years ago

Have you replaced ESPAsyncTCP library aswell?

@gnalbandian No, I am using the ESP32, and as was pointed out, I need the AsyncTCP for the ESP32. Do you have this as well?

have the TESTPUB_async_server.ino and etc. as a vscode + platformio

@BlueAndi Unfortunately I have not. I have installed pio, but so far failed to run it. :-( If you manage anything, I would appreciate if you provide a copy.

ensure that the downloadPage() is not called during a already running download?

it is presently ensured because the download takes only <1 sec, while the HTTP refresh is set for 10sec. Thus a >9 sec break before the next download.

Though, I may be mistaken, but I think the ESP32 Async server should manage also that? But presently there is no overlap.

The crashes are all of this pattern:

09:59:26.730 -> 3 Loop: deferredRequest handling
09:59:31.995 -> E (58919) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
09:59:31.995 -> E (58919) task_wdt:  - async_tcp (CPU 0/1)
09:59:31.995 -> E (58919) task_wdt: Tasks currently running:
09:59:31.995 -> E (58919) task_wdt: CPU 0: IDLE0
09:59:31.995 -> E (58919) task_wdt: CPU 1: loopTask
09:59:31.995 -> E (58919) task_wdt: Aborting.
09:59:31.995 -> abort() was called at PC 0x40140f68 on core 0

in most of them there is this (CPU 0/1); very few have (CPU 1). I think I have never seen (CPU 0).

I am attaching my 300k file. Unzip to use, it is highly compressible due to the simple content. I hope that some of you give it a try.

TEST300k.cam.zip

ullix commented 3 years ago

In increasing desperation I wanted to make sure that the choice of file system does not play an unexpected role. So in 3 subsequent runs I tested FFat, LittleFS, and SPIFFS. (Easy to do in my testcode, attached here: https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-849659026 just change the FS in lines 28-30, and make sure to have matching partitions). In all runs I tried to download a 300k file.

    // FS: FFat:      File: 300.0 kB, Success:  25.0%
    // FS: LittleFS   File: 300.0 kB, Success:  26.8%
    // FS: SPIFFS     File: 300.0 kB, Success:  19.3%

This time only 1 in 4 or even 1 in 5 succeeds, all other attempts crash with WDT. FFat and LittleFS are basically the same. SPIFFS is a bit further behind, but this is not an option anyway, because I found SPIFFS to be a 'Flash Memory Catastrophe' as shown here: https://forum.arduino.cc/t/esp32-mit-spiffs-eine-speicher-katastrophe/621573

I am not seeing any further option as to change the Watch Dog Timer duration. Can anyone tell me how to do this in the Arduino framework? If I have to do it by hard-coding somewhere in the ESP* files, that'd be ok too.

BlueAndi commented 3 years ago

I used my esp32 project and downloaded your TEST300k.cam file 400 times without any reset under following conditions:

Hmm ...

ullix commented 3 years ago

Hmm ...

Hmmm. So true! This really freaks me out. Could you please say which OS you are using, and what the precise type of your ESP is?

Is there a chance I could get your code for testing?

I have searched for the file where all the Watch Dog Timers are defined, and it seems to be ~/.arduino15/packages/esp32/hardware/esp32/1.0.6/tools/sdk/include/config/sdkconfig.h

I looked up all #defines which had 'WDT' in them, and changed and then tested all values in 3x steps, like x1, x3, x10, x30. There was some minor response on some, but nothing ever near 100%. What I think being the most likely candidate is:

#define CONFIG_TASK_WDT 1

I guess this is in units of seconds? I tried 3, 10, 30, with success rates of 59%, 81%, 78%, 55%. It does not look like the 'WDT crashes' do even have anything to do with that WDT setting. Is that possible?

BlueAndi commented 3 years ago

I used the TTGO T-Display esp32 module for test and my current "unstable" branch of Pixelix. For the test I removed some images from /data/images folder, placed the test file to /data and added the following code snippet to the /src/Web/Pages::Init() function.

    (void)srv.on("/TEST300k.cam", HTTP_GET, 
        [](AsyncWebServerRequest* request)
        {
            request->send(SPIFFS, "/TEST300k.cam", String(), true);
        })
        .setAuthentication(webLoginUser.c_str(), webLoginPassword.c_str());

The task WD you are searching for is in the AsyncTcp: https://github.com/me-no-dev/AsyncTCP/blob/ca8ac5f919d02bea07b474531981ddbfd64de97c/src/AsyncTCP.cpp#L188-L207

Typically you can not just change the sdkconfig.h, because the code behind is already compiled and linked as a library. IMHO the Arduino IDE uses the library, equal to platformio.

I will create a simple platformio example, for easier testing and come back again.

BlueAndi commented 3 years ago

@ullix Can you try this one? https://github.com/BlueAndi/WebServerTest

ullix commented 3 years ago

Typically you can not just change the sdkconfig.h, because the code behind is already compiled and linked as a library. IMHO the Arduino IDE uses the library, equal to platformio.

To be sure of correct understanding: I made the various WDT changes in the file sdkconfig.h itself, not in any of my *.ino files! Unfortunately, Arduino has no 'recompile everything' command, but from what I see, Arduino does compile everything when some of those files gets changed. Correct me if this is wrong, but the fact that I did see responses to my changes, though not enough, suggests that my changed got compiled.

The task WD you are searching for is in the AsyncTcp: https://github.com/me-no-dev/AsyncTCP/blob/ca8ac5f919d02bea07b474531981ddbfd64de97c/src/AsyncTCP.cpp#L188-L207

This CONFIG_ASYNC_TCP_USE_WDT is only the flag to determine whether a WDT should be used or not. A WDT duration cannot be found in this lib, and I also do not see which WDT will be used. I guess it is one setting insdkconfig.h, but which one?

In AsyncTCP.h I found this segment related to WDT and core:

//If core is not defined, then we are running in Arduino or PIO
#ifndef CONFIG_ASYNC_TCP_RUNNING_CORE
#define CONFIG_ASYNC_TCP_RUNNING_CORE -1 //any available core
#define CONFIG_ASYNC_TCP_USE_WDT 1 //if enabled, adds between 33us and 200us per event
#endif

So I tried - all changes done within file AsyncTCP.h - all 4 possible combinations of core=0, core=1, WDT=0, WDT=1 with this outcome:

Looks like the problems originate even before AsnycTCP?

BlueAndi commented 3 years ago

To be sure of correct understanding: I made the various WDT changes in the file sdkconfig.h itself, not in any of my *.ino files! Unfortunately, Arduino has no 'recompile everything' command, but from what I see, Arduino does compile everything when some of those files gets changed. Correct me if this is wrong, but the fact that I did see responses to my changes, though not enough, suggests that my changed got compiled.

If a change in the sdkconfig.h causes changes in header files, yes. If it causes changes in the source files, no. :-) It may cause strange behaviour in case it will relate to changes in headers, combined with linked libs. Platformio e.g. uses the linked libraries from https://github.com/espressif/arduino-esp32. I am not sure about how the Arduino IDE deals with it.

This CONFIG_ASYNC_TCP_USE_WDT is only the flag to determine whether a WDT should be used or not. A WDT duration cannot be found in this lib, and I also do not see which WDT will be used. I guess it is one setting insdkconfig.h, but which one?

https://docs.espressif.com/projects/esp-idf/en/v3.3.5/api-reference/kconfig.html#config-task-wdt-timeout-s Note, may depend on your underlying IDF version.

If WDT=0 (i.e. switched off) then up to a handful of downloads are possible, but then the CPU is completely blocked no matter what core is used; does not even respond to input via serial monitor; does not recover even after > 10min.

I assume that its getting blocked, because queue handling blocks infinite. https://github.com/me-no-dev/AsyncTCP/blob/ca8ac5f919d02bea07b474531981ddbfd64de97c/src/AsyncTCP.cpp#L106-L116

ullix commented 3 years ago

@ullix Can you try this one? https://github.com/BlueAndi/WebServerTest

I did try, and the results are as bad as before.

I used 2 devices, both included as 'ESP32 Dev Module':

  1. a 'ESP32-WROOM-32E' (the very latest ESP version) with 16MB, of which only 4 MB were partitioned for SPIFFS.
  2. a 'ESP-WROOM-32', the oldest one I have in my box, with 4MB, all partitioned for SPIFFS.

I did manually trigger the downloads on Chromium. The browser shows success in its bottom line like this:

Auswahl_053 Out of my first 6 downloads 3 failed with "network error", which always goes hand-in-hand with a WDT triggering a crash. The Serial Monitor shows the typical result:

10:56:59.411 -> [I][main.ino:271] loop(): 2 - /testData/TEST300k.cam requested (uc2).
10:56:59.411 -> E (256068) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
10:57:04.807 -> E (256068) task_wdt:  - async_tcp (CPU 0/1)
10:57:04.807 -> E (256068) task_wdt: Tasks currently running:
10:57:04.807 -> E (256068) task_wdt: CPU 0: IDLE0
10:57:04.807 -> E (256068) task_wdt: CPU 1: loopTask
10:57:04.829 -> E (256068) task_wdt: Aborting.
10:57:04.829 -> abort() was called at PC 0x40140d5c on core 0

Your uc2 is download in loop context, while uc1 is download in webserver context. I tried this too, it is neither better nor worse.

Instead of downloading my data file and saving on the desktop, one can also download and display the file in the browser window. As my "binary" file consists of only printable-ASCII, it does display well on the screen.

Yet, no improvement. Roughly half the downloads are successful, the other half results in a WDT crash when approx half of the file has loaded.

Any ideas?

BlueAndi commented 3 years ago

Strange ... I used now my DevKit-v1 board: image

image

image

Can you erase the whole flash and program it again?

image

ullix commented 3 years ago

This is what I can offer:

Auswahl_054

Why are all speeds == 0?

I can't use PIO, what Flash erasing procedure do you have on mind? All I can do is partitioning and formatting.

BlueAndi commented 3 years ago

IMHO you can erase the flash with Arduino IDE too. See tools menu -> Erase flash -> All As long as we don't use the same setup, the investigation is quite hard.

ullix commented 3 years ago

Nope. Neither Arduino 1.8.13 nor 2.0 beta7 have such a menu option for ESP32. I believe it is something Board-specific. However, few things can be simpler that erasing the Flash on an ESP32:

To erase the entire flash chip (all data replaced with 0xFF bytes):
      esptool.py erase_flash

It finds everything else on its own. I did that, and look at the outcome:

ullix@crucialmx200:~/Code/Arduino/main$ ./test.sh uc1 10
using file:  TEST300k.cam
1 Successful: 0,792s 0kb/s
2 Successful: 0,747s 0kb/s
3 Successful: 0,734s 0kb/s
4 Successful: 0,751s 0kb/s
5 Successful: 0,748s 0kb/s
6 Successful: 0,767s 0kb/s
7 Successful: 0,760s 0kb/s
8 Failed.
9 Successful: 0,749s 0kb/s
10 Successful: 0,722s 0kb/s
Successful downloads: 9 out of 10
Failed downloads    : 1 out of 10.

So, something bad in the Flash. Who would have thought; given no error message of anything. I'll look into details tomorrow.

BlueAndi commented 3 years ago

I know that there is sometimes bad data stored, which is used by some of the underlying arduino/idf components. The wifi is such a component. Original as I started my project Pixelix, I investigated two weeks to catch a problem in wifi ap mode. The solution was easy (if you know it), just erasing the flash and disable storing wifi configuration in nvs.

ullix commented 3 years ago

Good thing I celebrated right away, because one day later there was nothing left to celebrate :-((.

Testing the download of files 50k and 90 always resulted in 100% success.

Actually, not quite true: sometimes the test code stalled forever. But after inserting a 1sec sleep between subsequent calls, all later runs went fine. Change to your test code:

    fi;
    sleep 1s
done

However, the 300k file, whether via uc1 or uc2 (i.e. by regular static call or by deferred call) had a success rate of something in the low 80%. No difference between the two. An overnight run of 1000 calls had a success of 83.3%; one in five downloads failed and resulted in a WDT triggered reboot.

I then repeated the erasure of the flash via esptool.py erase_flash and reran the tests. No improvement. Both uc1 and uc2 came out at the low 80+%.

So, a hiccup in the Flash is not the cause for these problems.

ullix commented 3 years ago

I have retested this with file systems SPIFFS, LittleFS, and FFat, and the outcome is basically the same in each one. Beyond a file size of 130k the frequency of WDT triggered crashes increases up to a file size of ~300k. Using deferred processing vs. static command mitigates the problems sometimes, but not always, and never heals it.

On Arduinoe IDE 1.8.13 I have now installed this exception decoder https://github.com/me-no-dev/EspExceptionDecoder Among the many crashes I see only 2 different Backtraces:

Backtrace: 0x400886e8:0x3ffbf890 0x40088965:0x3ffbf8b0 0x4013e2a8:0x3ffbf8d0 0x40086e79:0x3ffbf8f0 0x401503bb:0x3ffbc220 0x4013fba3:0x3ffbc240 0x4008b125:0x3ffbc260 0x40089976:0x3ffbc280
Decoding stack results
0x400886e8: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 156
0x40088965: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 171
0x4013e2a8: task_wdt_isr at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/task_wdt.c line 174
0x401503bb: esp_pm_impl_waiti at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/pm_esp32.c line 492
0x4013fba3: esp_vApplicationIdleHook at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/freertos_hooks.c line 63
0x4008b125: prvIdleTask at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/tasks.c line 3382
0x40089976: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

Backtrace: 0x400886e8:0x3ffbf900 0x40088965:0x3ffbf920 0x4013e368:0x3ffbf940 0x40086e79:0x3ffbf960 0x4015047b:0x3ffbc220 0x4013fc63:0x3ffbc240 0x4008b125:0x3ffbc260 0x40089976:0x3ffbc280
Decoding stack results
0x400886e8: invoke_abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 156
0x40088965: abort at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/panic.c line 171
0x4013e368: esp_task_wdt_init at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/task_wdt.c line 206
0x4015047b: nvs::HashList::HashList() at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/nvs_flash/src/intrusive_list.h line 33
0x4013fc63: esp_deregister_freertos_idle_hook_for_cpu at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp32/freertos_hooks.c line 113
0x4008b125: prvIdleTask at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/tasks.c line 3382
0x40089976: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

I really don't know what conclusions to draw from it. Can somebody help?

Is there any obvious thing to look at and investigate further?

zekageri commented 3 years ago

I have managed to drastically improve the performance of serving the files directly from LITTLEFS with deferred requests with queues. Btw i dont use serveStatic().

Here is an example how i did it.


// define the queue handler.
QueueHandle_t downloadQueue;

static const inline void Init_FileServing_Task(int Stack_Depth,int Core,int Priority){
    xTaskCreatePinnedToCore(fileServing_Task,"fileServing_Task",Stack_Depth,NULL,Priority,&fileServing_Task_Handle,Core);
}
void setup(){
  // First create a task just for serving the files.
  Init_FileServing_Task(6000,1,2);
  // Register the events before server begin
  downloadEvent();
}

void loop(){
  // Bunch of other function call in here...
  /* And i have like 6 or 7 task running at once. 
      One task out of the 6-7 is running on core 0 and it is a
      heavy continous modbus communication task with hardware UART  */
}

static const inline void downloadEvent(){
    server.on("/DownFile", HTTP_GET, [](AsyncWebServerRequest *request){
       // Put the request in the queue immidiately
      if( xQueueSend( downloadQueue , ( void * ) &request, ( TickType_t ) 10 ) != pdPASS ){
          // if for some reason (queue full etc..) we can't put the request to the queue, we send 500 response.
         request->send(500); // ( internal server error )
      }
    });
}

static const inline void deferredDownload_Handle(){
 // in the handle situation we proceed if we got anything in the queue.
 if( uxQueueMessagesWaiting( downloadQueue ) ){
    AsyncWebServerRequest* deferredDownload;
    if( xQueueReceive( downloadQueue, &( deferredDownload ), ( TickType_t ) 10 ) ){
      AsyncWebParameter* p = deferredDownload->getParam(0);
      String File = p->value();
      if( ( LITTLEFS.exists(File) && adminModeIsOn ) || ( LITTLEFS.exists(File) && isFileAllowed(File) ) ){
        deferredDownload->send(LITTLEFS, File, "text/plain", true);
      }else{
        AsyncWebServerResponse *response = deferredDownload->beginResponse(204, "text/plain", "No such file.");
        response->addHeader("Access-Control-Allow-Origin","*");
        deferredDownload->send(response);
      }
    }
  }
}

void fileServing_Task( void * parameter ){
    // create a queue with 10 item spaces. It is because the TCP socket can hold only 10 tcp requests at a time.
    downloadQueue = xQueueCreate( 10, sizeof( AsyncWebServerRequest* ) );
   // ever is defined as --> ( ;; )
    for ever {
        if(!Firmware_Is_Updating){
            deferredDownload_Handle();
        }
        vTaskDelay(1);
    }
}

This way the files served correctly. I wrote a js test sketch where i can define the file, the count of the downloads and the delay between downloads and i can select if the download request should start one after the other or to wait for the previous download response. And it seems very very stable. I did some tests with this, but not too much to be able to form a conclusion of it yet. I have no time for it now but i will test it further. For now, i got 1000 downloads and it was all success if the delay between downloads at least 30ms. The downloaded files was 10kb to 150kb. I haven't tested with bigger files yet, but i will because i will reach a point where i must serve 200 or even 300kb files so it is a must for me too.

ullix commented 3 years ago

@zekageri I wish you good luck, but so far I am not impressed.

With a file size of 130k I get 100% success consistently, no matter which of the two approaches (deferred or static) I take. With 150k I still get between 95% and 99% success. I wouldn't be surprised if there were a little scatter between some ESP32 devices (though I have not really seen anything in my collection of devices).

The "sweet spot" for problems are file sizes around 300k. I use the 300k file as attached here: https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-852836070 as the bad guy for testing.

Success rate tends to become a little bit better for even bigger files. Why? No idea.

ullix commented 3 years ago

A strangely similar problem reported on the Espressif forum about 2 years ago, with no clear solution, as far as I an see:

ESPAsyncWebServer Fails When Serving Large Files https://www.esp32.com/viewtopic.php?f=19&t=9740

lbernstone commented 3 years ago

If you know you have files that regularly will take longer than 5 seconds to deliver, why not just increase the wdt timeout? It is not easy to see exactly where to apply this for serveStatic, but could certainly be done on your delayed response.

#include <esp_task_wdt.h>
...
esp_task_wdt_init(10,0);
// deliver the file
esp_task_wdt_init(5,1);
zekageri commented 3 years ago

Its not about the wdt. The server freezes. If you turn off the wdt completely the result is the same

zekageri commented 3 years ago

Btw iam using an esp32 wrover 8mb bare chip with psram

ullix commented 3 years ago

@lbernstone I have already pursued the idea of lengthening or even switching-off the WDT, as I showed here: https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-855256610 and https://github.com/me-no-dev/ESPAsyncWebServer/issues/984#issuecomment-856509778

The result was a marginal improvement, if any at all!

However, your code suggestion is very clear and simple so I tried it in the uc2 (deferred handling in the loop) case. The results for downloading the 300k file (the baddest guy) for a good dozen times were all over the place, with successes in 60 ... 100(!) out of 100 tries. Whenever it failed, the ESP rebooted with the usual message:

11:26:43.118 -> E (21774) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
11:26:48.382 -> E (21774) task_wdt:  - async_tcp (CPU 0/1)
11:26:48.415 -> E (21774) task_wdt: Tasks currently running:
11:26:48.448 -> E (21774) task_wdt: CPU 0: IDLE0
11:26:48.472 -> E (21774) task_wdt: CPU 1: loopTask

Given that the panic flag in esp_task_wdt_init(10,0); was set to zero, I am surprised that it even crashed with WDT anyway? Shouldn't that have been switched off by this command? (https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/wdts.html#_CPPv417esp_task_wdt_init8uint32_tb)

As a WDT of 10 sec did not seem enough, I increased it to 30 sec. Yet, no improvement! Increasing to 50sec, 100sec: almost every download attempt was a failure!

I have already wondered whether the length of processing triggering the WDT is the prime cause for the crash, and the WDT may be some secondary effect to quite a different problem. Maybe "the server freezes" as @zekageri suggests - well, what exactly do you mean by that? Could you put some more meat on the bones of that statement?