stm32duino / STM32Ethernet

Arduino library to support Ethernet for STM32 based board
151 stars 41 forks source link

Ethernet instable, kind of deadlocking the controller; Nucleo F429ZI #3

Closed pulli7 closed 6 years ago

pulli7 commented 7 years ago

I'm expiriencing some issues while using the F429ZI as a http-server. (quite simple application, where a few controller pins are set according to http requests from clients)

It generally does what it is meant to do, but at some point clients can not reach the server any more. If client is a web browser, opening a new tab solves this temporarely, but soon or later the controller gets kind of locked, meaning it is completly unreachable, and even code in main loop does not seem to get executed any more (tested by attaching switch on one pin, which turns on/off another GPIO).

There does not seem to be a specific number of requests or clients leading to failure, sometimes two or three requests from one single client already lead to a complete lock, sometimes it takes a few thousands from different clients, but at some point it always fails. One strange thing is, that requests from webbrowser seem to lead to failure way faster, than from other applications like Curl, especially when multiple clients are involved. Maybe some timing issue??

The behavior can be reproduced with the server-example from the library. Just starting the server and then have for example five or six Curl clients throwing requests at it every few seconds, makes it fail quite fast most of the time, especially when also sending requests from a webbrowser manually. I used this simple bat script for testing:

for /L %%N IN (1, 1, 10000) DO (
echo Nummer %%N
t:\curl.exe 10.66.22.124
ping -n 2 127.0.0.1 > NUL 
)

I hope there is someone who can also reproduce this and is familiar with the ethernet implementation, because I absolutely have no clue how to track this down...

ghost commented 7 years ago

I confirm the bug. But it is hard to reproduce it.

If you have time to start to track it you can enable the debug trace of LwIP:

My time is limited currently but as soon as possible I will take a closer look.

pulli7 commented 6 years ago

Took a look at LWIP debug output meanwhile. Looks like some kind of memory leak,
MEMP is complaining about memp_malloc: out of memory in pool PBUF_POOL as soon as the Controller is getting unresponsitive. PBUF also gives some message at that point, which I am not quite sure about the meaning, but tends to be something like buffer overflow or problem clearing a buffer: Although I have to admit that I have no knowledge about the internals of TCP-Communication at all, so could be completly wrong here...

pbuf_free(0x8edfe)
pbuf_free: 0x8edfe has ref 65534, ending here.

LogFiles of debug output: TCP_Log.txt TCP_MEMP_log.txt PBUF_Log.txt

Also noticed that the bug also occurs when using a single tcp connection, which is just connected once and then held open, so it does not seem to be related to the process of connecting/disconnecting clients.

ghost commented 6 years ago

Thank you @pulli7 for your useful log files.

I reproduced this issue too. Same message before the crash: memp_malloc: out of memory in pool PBUF_POOL What I can propose to you it is to increase the LwIP allocated memory size in lwipopts.h under the parts /* ---------- Memory options ---------- */ and /* ---------- Pbuf options ---------- */. You can "play" with MEM_SIZE, MEMP_NUM_PBUF and PBUF_POOL_SIZE. I can't help you more because it depends on your application.

Let me know if you have fixed this issue.

pulli7 commented 6 years ago

Just to be clear, I first discovered the bug in my own application, but everything I describe here, was done with the WebServer example, that comes with the library! Only modification was change of Ip-address, and in case of the the logs removing the serial prints. It definitly is a general issue, not limited to my own sketch.

I did some testing with the memory options. It has some impact on the behavior, but I was not able to solve the bug. The average time until a crash occurs increases significantely, when tuning the values up ( PBUF_POOL_SIZE seems to have the biggest impact). But even with extreme settings, like making all three values four times bigger than standard, I still often get those cases, where the crash occurs just after less than 100 requests...

ghost commented 6 years ago

I upgraded the LwIP library to the version 2.0.3. Can you test with the new version? here

Furthermore, just for information, a link about the memory configuration of the LwIP stack.

pulli7 commented 6 years ago

Thank you for your effort to solve this issue. I will not be able to get my hands on the board during the next two weeks, as I am in another place right now. But I will definitely do some more detailed testing with the new version, and report back here, as soon as I return.

pulli7 commented 6 years ago

Tested LwIP 2.0.3 with the Webserver sketch from the library, sadly the behaviour is the same as before...

I then enabled stat_display(), as described in the information about LwIPs memory config, to take a closer look at the buffers. There seems to be a problem with MEM TCP_PCB; it gets packed up completly very fast. Maybe a problem with closing connections? Just guessing here...

When continuing to send requests after MEM TCP_PCB is full, the heap also gets packed up completly. It even gets bigger than the maximum value set, likely corrupting the RAM until something critical is hit and crashes.

Logfile: (only one single Google Chrome tab used as client) LogStat_display.txt

I tried giving LwIP more heap and increased the MEMP_NUM_TCP_PCB define, to see if the increasing memory usage stops at some point, but that seems not to happen. MEM TCP_PCB always gets packed up, no matter how many simultanious connections I allow. And even a quite big heap like 50kB overflows after some time, and leads to crash:

Logfile: (again one single Google Chrome tab as a client) Log_Heap=50kB_TCP_PCB=20.txt

One other thing I noticed, but think is not that relevant for the issue, just want to mention it: Setting really large values for MEM_SIZE does not work, as it leads to problems with memory allocation. Getting Error: mem_malloc: could not allocate _xx_ bytes Logfile: Log_Heap=140kB_TCp_PCP=50.txt

ghost commented 6 years ago

Hi @pulli7

I took the time but I think I have something that can resolve your issue. Please could you try the PR #5 ?

With this fix, MEM HEAPdoesn't increase anymore until the crash.

Another precision: MEM TCP_PCB increase until its maximum but it is normal because the TCP stack needs about 2 minutes before to delete a pcb. If the LwIP stack needs to allocate a new pcb, an old one will be removed faster (if in CLOSED state).

Please keep me inform.

pulli7 commented 6 years ago

Thank you very much @fprwi6labs !

Tested this morning, it now works perfectly stable! With the Webserver example and default settings in lwipopts.h, heap usage constantly stays below 2kB for me.

Issue can be marked as resolved.

fpistm commented 6 years ago

Thanks @pulli7 for your tests and @fprwi6labs for the fix. I will merge it when the cb mechanism will be removed in the PR.