nopnop2002 / esp-idf-ftpClient

ftp client for esp-idf
50 stars 8 forks source link

ftpClientConnect sometimes hangs #7

Closed yehj0107 closed 3 months ago

yehj0107 commented 2 years ago

I am using this FTP client library together with ESP32 module with IDF version 4.4.2. I put 10 devices to test, and 3 of them the stuck in the connectFtpClient function after operating one day. Is there any hint or method to further debug?

nopnop2002 commented 2 years ago
device 1 connect to router.
device 2 connect to router.
.........
device 16 connect to router.
device 17 connect to router. ---> device 1 is disconnected from router.
idea--list commented 2 years ago

I am facing similar issue. When running the example bundled to this repo everything is fine even for several hours. I inserted the functionality into my own code as carefully as i could, but somehow in that case the execution always hangs after about 50-60 minutes of running. What i am trying to do is to send sensor data every minute (FTP transaction takes 3-5 seconds each time, so not heavy duty IMHO). Increasing stack sizes improves a bit (runs then about 3 hours long) but does not seem to really solve the issue.

Turned DEBUGGING level to 2 and the ESP32 almost always hangs right after it has logged in to the FTP account (rarely it hangs during the data transmission). First i thought it occurs only when powered from 18650 cell, but it also occurs when powered via USB. Tried 2-3 different board types. Have a hard time figuring this out.

idea--list commented 2 years ago

Have some further info and observations to share:

In ESP-IDF i turned on brownout detection and set to 2.8V (highest option). When program execution hangs, the board is not being restarted by the brownout detector. So the cause is most probably not a brownout event.

I have several tasks in my program:

I can run 360 cycles if each cyle is 30 seconds long (resulting files are 30KB) without the program logic hanging. However if i try to do 720 cycles, the logic never gets to the end, it always just hangs after about 3 hours. Free heap size varies between 48-57 KB right after logging in to FTP, lowest heap size ever is reported as 15.5 KB (lowest heap occurs many cycles before hanging occurs). I face hanging also if i try to record 180 x 60 seconds (resulting 60KB files), despite i do not face any issues when trying 360 cycles of 30 seconds, which result the same duration...

When the program hangs, the panic handler is not invoked, the board does not restart, nothing is printed in the terminal... literally just nothing happens in any tasks described above. Even the FTP conncection does not timeout to let the logic proceed further .... for my biggest surprise however i find that the SNTP task (created by sntp_init() API in the background inside app_main when we connected to WiFi) is still running once per hour (as set for SNTP in menuconfig).

So to sum it up i have no idea why this hanging occurs, but the cause is not catched by the panic handler, it is not a brownout, stack smashing detection also does not report anything. Just the CPU stops running any user defined tasks... while SNTP time update is still being run with the interval set for that.

This really puzzles me and can not debug it further as i only began with ESP32 chips 3 months ago.

@nopnop2002 do you have any idea what might be causing such behaviour? Would it be possible to cancel any ftp functions when nothing happens for a while after a successful ftp login or when there is no progress during an ftp transaction for a while? I guess these would allow to overcome this issue (but i might be wrong)

nopnop2002 commented 2 years ago

@idea--list

can you share your code?

idea--list commented 2 years ago

@nopnop2002 thanks for the reply! Meanwhile i made some code review of my own code and found a possible infinite loop in case the sensor would stop triggering interrupt signals and thus the counter for remaining samples would not decrement anymore. Now i am testing to decrement the counter with a const variable. Will report back in 1 day if this solves the issue, in which case the bug is in my code.

idea--list commented 2 years ago

@nopnop2002 The freezing also occurs if the counter of remaining samples is detached from the sensor and is decremented by a const value every X milliseconds. When the issue occurs, it is also impossible to trigger the interrupt handler by manipulating the pin level to which the ISR is attached.

I would not post my code publicly, do you have an email address?

nopnop2002 commented 2 years ago

@idea--list

One way to keep your code private is to make it a private repository.

There is a function called "GitHub Security Advisories for repositories", but I don't know how to use it.

https://docs.github.com/en/code-security/repository-security-advisories/about-github-security-advisories-for-repositories

idea--list commented 2 years ago

@nopnop2002 The facts nothing useful is printed on the console and it takes at least 3-4 hours until the issue occurs make debugging my 2000 lines harder than i would wish, but i can confirm the issue also occurs if i do not use this repo in my code. So the error is either in wifi handling of esp-idf or somewhere in my user code, not in this repo. Sorry if my report took much of your time.

nopnop2002 commented 2 years ago

@idea--list

If you try to connect to an FTP server while Wi-Fi is disconnected, you may have to wait until the TCP Socket timeout.

TCP Socket timeout is approximately 3 minutes.

At first glance, it looks like the app is hanging.

idea--list commented 2 years ago

Turns out ESP32 GPIO interrupt handling is buggy. This explains everything i observed. Will have to reimplement my code.

nopnop2002 commented 2 years ago

One solution is to use Queues instead of interrupts.

This is the code I use often.

typedef struct {
    uint16_t command;
    TaskHandle_t taskHandle;
} CMD_t;

void buttonA(void *pvParameters)
{
    ESP_LOGI(pcTaskGetName(0), "Start");
    CMD_t cmdBuf;
    cmdBuf.command = CMD_PUSH_A;
    cmdBuf.taskHandle = xTaskGetCurrentTaskHandle();

    // set the GPIO as a input
    gpio_reset_pin(GPIO_INPUT_A);
    gpio_set_direction(GPIO_INPUT_A, GPIO_MODE_DEF_INPUT);

    while(1) {
        int level = gpio_get_level(GPIO_INPUT_A);
        if (level == 0) {
            ESP_LOGI(pcTaskGetName(0), "Push Button");
            while(1) {
                level = gpio_get_level(GPIO_INPUT_A);
                if (level == 1) break;
                vTaskDelay(1);
            }
            xQueueSend(xQueueCmd, &cmdBuf, 0);
        }
        vTaskDelay(1);
    }
}
idea--list commented 2 years ago

Thanks for the suggestion @nopnop2002 , reimplemented my code. Basically i converted edge type interrupts to level type interrupts using a Queue. This solution should be safe based on the silicon errata documentation of the ESP32 chip. Have run several 6 hours long tests: if i do not include this repo in my project my code runs 6 hours long without any issues (tested several times) even if i use WiFi connection and SNTP. This was not the case before making the change, so there is some improvement. To be absolutely sure i will still need to run multi-day tests.

However as soon as i am trying to use this repo to transfer previously completed files from the SD card via FTP, while i am keep recording sensor data to another file then i still face the same symptom: the program hangs after 1-3 hours without any errors/warnings. This might be a livelock issue also described in the errata, but i am not sure. Will need to test what happens if i do not record data while transferring over FTP, but that will need some more days...anyway it seems now there is no bug in your repo, but maybe there are some HW issues with the ESP32 in general that need a workaround.

nopnop2002 commented 2 years ago

I remember that there was some problem when executing SD card IO and WiFi communication at the same time, but I can't remember.

output_task (stack size: 16384, priority: 4, CPU: 0) - this creates files into which sensor data is being logged, after a given number of samples are recorded and saved to the file, the file is closed and the next file being created

Which SD card are you using?

SD-SPI SD-MMC

idea--list commented 2 years ago

I am using microSD cards (32 GB SanDisk) interfacing it over SD-SPI implementation using ESP32's SPI3 bus (pins 5, 18, 19, 23). I know SD cards have their own maintenance working in the background which result HUGE write latencies once a while in a completely unpredictable pattern, which gets worse as there are more files stored on the card (which is the case during longer tests). However i am using binary file to reduce this effect and also the sensor instructs the MCU to grab the data via an interrupt and sensor data is stored into a ringbuffer first (which never runs out of empty slots). Saving those data to the SD card occurs in a separate task, this way write latencies of the SD should not stop acquisition of sensor data.

I also reduced stack sizes of my tasks, they never run out of memory (and even if they would, that would trigger a restart).

I know there is an issue with ADC2 pins while using WiFi, but my sensor uses GPIO33 for signaling the interrupt which is an ADC1 pin, so that bug should not play any role here. Will do more tests, reduce my current code to the minimal amount needed to reproduce this and then i will try to upload it into a private repo. All this will need several days.

nopnop2002 commented 2 years ago

Here are the SD IO benchmarks.

https://github.com/nopnop2002/esp-idf-sd-bench

Obviously, SD-MMC is known to be faster.

idea--list commented 2 years ago

Thanks for the tip, will also test using SDIO implementation. Will take a while to do all the tests i have on my plan.

nopnop2002 commented 2 years ago

I am using microSD cards (32 GB SanDisk) interfacing it over SD-SPI implementation using ESP32's SPI3 bus (pins 5, 18, 19, 23).

sensor_task (stack size: 16384, priority: 5, CPU: 1) - retrieves sensor readings

Does your sensor use SPI?

idea--list commented 2 years ago

Yes, i use the same SPI bus for the sensor and for the SD card. The SPI bus is shared by 3 tasks and 2 devices. Meanwhile i tested what happens if i cut each cycles into 2 parts: an active phase (seconds during which i store sensor data to SD) and an inactive phase (during which no data is being saved to the SD, and the FTP transfer is done here). The sensor is active and delivering data over SPI in both phases. During 6 hour tests (360 x 30+30 seconds) this works without the issue. So until there are only 2 SPI processes the ESP32 can handle them. Introducing a third SPI process for short periods seems to work only for 1-3 hours but after that the system always hangs in a strange state.

Next i will try to acquire/release the bus during sensor readouts, hope that fixes the issue without breaking the file operations.

nopnop2002 commented 2 years ago

i use the same SPI bus for the sensor and for the SD card.

This may not be related to this issue, but I've had this problem before.

https://github.com/espressif/esp-idf/issues/6510

There is a way to separate the SPI bus with the sensor and the SD card.

idea--list commented 2 years ago

Acquiring/releasing the SPI bus for the duration of sensor readings has some effect: right after flashing modified code, i was facing the issue just after 12 minutes. Could not observe the issue this soon before. Anyway made a system reset to give it another try. Doing so code was running fine for 300 minutes. The issue always occurred in the first 180 minutes until now...

So bus acquisition on that section seems to have some influence but for sure does not fix the issue, however byte sizes of the files did not change, so file operations are probably not affected by bus acquisition.

As the code runs for 6 hours if there were either only fwrite (saving data to file on SD) or only fread (FTP transfer) calls even without bus acquisition for the sensor readings, despite continous sensor readings on the SPI bus, i am beginning to get the impression that fwrite and fread operations may interfere in case they execute for longer periods. Even if they operate on different files (fwrite stores data to file A, while fread reads data from file B as part of the xfer() function of the FTP transfer).

I am not an expert regarding file operations on low level, but my understanding is that filestream buffer collects data until it is full. In that moment it will try to write data to the file immediately. Based on SD latency it can vary how long it takes to finish file write operations. If this is true, i may not lock the SPI bus during file reading in the FTP transfer as that would prevent filestream buffer to save incoming sensor data immediately when it is needed. However inside xfer() function the FTP logic uses fread in a while loop to read in data from the SD card. Now i think i should modify that part to prevent freads while there is being data written to the SD card. What is your opinion?

nopnop2002 commented 2 years ago

As far as I know SD-SPI is not thread safe. If you have two tasks reading and writing at the same time, you will get an error like this.

E (872011) sdmmc_cmd: sdmmc_read_sectors_dma: sdmmc_send_cmd returned 0x107
E (872021) diskio_sdmmc: sdmmc_read_blocks failed (263)
E (872021) sdmmc_cmd: sdmmc_write_sectors_dma: sdmmc_send_cmd returned 0x105
E (872031) diskio_sdmmc: sdmmc_write_blocks failed (261)
E (872031) READ: Failed to read file
E (872041) WRITE: Failed to write file

It seems to depend on the size of the SD card. https://github.com/espressif/esp-idf/issues/9289


I'm not sure about your entire application and the size of a single file, but if you don't need a local file after putting the file on an FTP server, I think Flash's file system is fine.

idea--list commented 2 years ago

Took a closer look what exactly xfer() does. I have the impression the while loop for FTP write operations always sends data to ftp if fread returns a higher value than 0, but while doing so it does not check for ferror or feof states. So theoretically an error during fread can remain unnoticed. Will try to implement some checks there and see if that catches anything when hanging occurs.

Right now i am still running a long test with WiFi and FTP completely disabled to confirm converting from edge type interrupts to level interrupts really solved a part of the issue. If so then the electronical signals and their timing should be in normal range at least while WiFi is not used (i have no oscilloscope to actually explore the signals). Stay tuned...

idea--list commented 2 years ago

Finally i could narrow down the issue. It seems that the fix suggested in the errata for the edge-type GPIO interrupt issue needs that at least 1 core can instantly react to the ISR which resumes to a high priority task to handle the situation. For me it seems that sometimes the sensor interrupt signal arrives just in a moment when some WiFi interrupt occupies core 0, while core 1 is busy doing some FTP stuff and also context switching is slower on core 1 than usual. I can not confirm this as i do not have an oscilloscope. Other 2 possible causes might be the live lock issue mentioned in the errata and maybe some priority inversion. Unfortunately none of these is easy to debug for me.

I am almost done reducing the code to the needed minimum, will upload it tomorrow.

idea--list commented 2 years ago

I'm not sure about your entire application and the size of a single file, but if you don't need a local file after putting the file on an FTP server, I think Flash's file system is fine.

The sensor is sampling at 512Hz and yields 60KB data every minute, ca. 100MB a day. The idea is that the user can set the duration of the files. SD card was chosen to prevent flash memory in the ESP32 from wearing out in just some weeks (plan to move also the ringbuffer to RTC memory once i get rid of the hanging issue, that part is normal ram, which does not wear out as fast as flash does)

idea--list commented 2 years ago

And interestingly i face the issue also if i transfer a 50 byte dummy file instead of the real sensor data. So the size of the payload to be transferred does not seem to matter.

idea--list commented 2 years ago

@nopnop2002 Just created a private repo and invited you as a collaborator. Have never done this before, hope it will work as expected. That way you can have a better insight if you wish so.

It seems i managed to solve the issue, at least it finished a 6 hour long run without any issue3 times. A longer run is on the go... You will find some instructions at the top of main.cpp.

Long story short: the workaround suggested in the errata needs a core that can react instantly whenever needed. WiFi is running on core 0 by default, FTP seems to be heavy on core 1 and context switching is occasionally not fast enough and in such moments the issue manifests again despite the proposed fix has been applied. It seems by pinning FTP to core 0 it can coexist with WiFi logic, while the workaround should be pinned to core 1 and further care should be taken to avoid priority inversions as even those would reintroduce the issue.

Took quite a while to debug but learned a lot and maybe other people can also benefit from reading these messages. Thanks for your help!