martin-ger / esp-open-lwip

ESP8266 lwIP library with NAT, SLIP, ENC28j60 Ethernet, and routing support
67 stars 25 forks source link

Multiple fixes #12

Closed xxxajk closed 5 years ago

xxxajk commented 5 years ago

This Merge request includes the following: Add RFC 1191 compliance for forwarding packets. Correct MAX_FRAMELEN, which was not set correctly. Remove dead code and unused conditional . Add a script to quickly compile with some predefined options.

All that seems to be left is to find out why WiFi interface dies when heavy traffic forwards to/from the enc28j60 interface. Once that is done I would consider it no longer experimental.

xxxajk commented 5 years ago

I've seen this "WiFi death on overload" issue before, FYI. The "fix" that has worked for me in the past was to switch to SDK 2. We should probabbly sync 2.0.0 experimental with this branch, as I believe the problem is on the closed source side, otherwise I think we'll be doomed to pounding our heads into the wall attempting to locate a work around.

martin-ger commented 5 years ago

Thanks a lot - just integrated this with the esp_wifi_repeater and at least without high load experiments it works great!

What do you mean with SDK 2.0.0? My development environment uses the esp-open-skd with the Espressif NONOS sdk 2.1.0-18-g61248df, the latest commit from pfalcon.

Maybe we should try this: https://github.com/piersfinlayson/esp-open-sdk (pending pull request)?

Cheers and thanks again!

xxxajk commented 5 years ago

I refer to https://github.com/martin-ger/esp-open-lwip/tree/sdk-2.0.0-experimental, is this an obsolete branch? The current branch is 1.5.0...

martin-ger commented 5 years ago

Okay, but this is just a newer lwip, all open source, isn't it? It doesn't include the newer closed WiFi-drivers and it is quite old. My guess would be to try the most recent NONOS SDK version.

xxxajk commented 5 years ago

Yea, I think the latest SDK may be of great help here too. I'll try that.

By the way, I think it helped a lot that I have written my own IP stack several years ago. This was because I did not like how all the other ones worked, and the fact that I could not run them within the RAM available on a Z80 on CP/M 2.x/3.x, MP/M and a few other systems. I have since ported it to Arduino. :-) Only caveat is that it currently only supports SLIP... With a little work, adding interfaces, it could replace lwIP, etc.

It can run in polled or interrupt mode and is MCU/CPU/OS/NOOS agnostic, providing the MCU is little endian, and that there is some kind of file system or file system emulation (such as EEPROM) to read the settings. Curious? https://github.com/xxxajk/ajkstack Just needs to be taught new interfaces, and it will "just work". The largest benefit is that it follows more-or-less the same interface you would find on a modern UN*X OS, like Linux.

edit: looks like I am using the latest as well... sdk -> ESP8266_NONOS_SDK-2.1.0-18-g61248df/

xxxajk commented 5 years ago

Looks like I am using latest as well... sdk -> ESP8266_NONOS_SDK-2.1.0-18-g61248df/

xxxajk commented 5 years ago

Looks like I am using the latest as well... sdk -> ESP8266_NONOS_SDK-2.1.0-18-g61248df/

martin-ger commented 5 years ago

Quite impressive. The ENC driver or even then WiFi driver of the ESP are probably a good starting point for a netif.

Actually, in the late 80th I also maintained a TCP/IP stack - written in Modula2 and part of the experimental BirliX operating system. So I have some old experiences in debugging this kind of SW as well... :-)

ESP8266_NONOS_SDK-2.1.0-18-g61248df is the latest with the esp-open-sdk from the original maintainer pfalcon. Espressif realeased newer version since then: 3.0 is the latest.

xxxajk commented 5 years ago

Been scraping the code Arduino uses... Has some interesting bits and pieces, will let you know what helps out. I have a hunch that there is a one character fix that might work out... will let you know in a few minutes.

xxxajk commented 5 years ago

Still poking about, seems like packets are coming in the ethernet too fast, so I will do a bit more testing by placing some kind of extra delay in the ISR.

xxxajk commented 5 years ago

Story so far... eth -> WiFi, stalls a bit, but works great! can operate as fast as you can pump. WiFi -> eth == spiral of death. :-(

Not sure what to make of this behavior yet, but I am considering attempting to do some kind of throttle, where inbound WiFi packets are tossed on the floor after some threshold. Strange thing is that the TX side returns 1. Very odd.

martin-ger commented 5 years ago

My guess would be a buffer problem. Buffets are allocated by the WiFi driver and maybe lost in an overflow situation. Is the free heap space going down?

xxxajk commented 5 years ago

No, I put in an os_printf to show heap, and it is never less than 20K :-/

xxxajk commented 5 years ago

BTW, code dies here https://github.com/martin-ger/esp_wifi_repeater/blob/master/user/user_main.c#L577 Returns a value of 1 forever. I guess that means the WiFi died.

martin-ger commented 5 years ago

This is the call to the original output function of the WiFi-driver.

As far as I understand, the packet coming from the ENC-driver is driven though the hole stack by the ENC interrupt handler. Maybe this causes the problem (e.g. with interrupt during interrupt). The original code of the WiFi driver doesn't expect, that there is any other interrupt in the stack.

One reason, why my initial approach was to schedule sending to a separate task. Same what I do in my 'esp_slip_router' with the UART netif. However, that would be a major restructuring...

BTW: what is your test scenario for these load tests?

martin-ger commented 5 years ago

Okay, when I'll find some time this week, I will try the following: in 'enc28j60_handle_packets()' enqueues the pbuf into a packet queue, schedules a task,and returns. (instead of calling 'enc_netif.input(p, &enc_netif)' directly). A separate enc28j60_poll() proc will dequeue the pbufs and call ethernet_input on it. It is called by a user task.

xxxajk commented 5 years ago

Sorry for the late reply, had to take a nap :-) Basically right now, the I am using:

Android <---WiFi ---> esp + enc <--- Ethernet --> Linux laptop

Then using ADB to push and pull a multi-megabyte file as the test. You could use anything on the WiFi side. I'll be setting up another laptop to do more testing tonight, as it might be of use, but I doubt it will make a difference, since Android is Linux anyway, but at least I'll get extra debugging tools.

xxxajk commented 5 years ago

Also, I like the idea of the scheduled task... you may want to do the same on the WiFi side when it has a packet come in too. When debugging sometimes I see overlapping printf output, indicating that there could be some kind of race condition going on there too.

martin-ger commented 5 years ago

I got the basic structure in place and receiving just works fine.

My current problem is in espenc.c 'enc28j60_link_output', I think...

When I read the EIR register after transmit, I often get an 0xff (this includes the error TX ERROR INTERRUPT bit). However, it does transmit correctly, but without a further delay afterwards, the enc28j60 stops working. Putting in a os_delay_us(1000) makes it somewhat work... Some sync problem, but no idea today.

I pushed the changes to a new branch in both the lwip and the repeater repository named "enc_polling". Feel free to have a look on it...

martin-ger commented 5 years ago

Did some further timing voodoo in enc28j60_link_output around the waiting for transmit (and a reset sequence seen in another driver) - now it runs fairly stable. But I still don't know exactly what I am doing...

xxxajk commented 5 years ago

I'd rather be lucky than just poking around anytime. I'll check it out.

xxxajk commented 5 years ago

Closer, I'll try bandwidth limiting and see if that helps. Now back to the E:M messages. E:M 26064 E:M 65536 E:M 51072 E:M 60056 E:M 65616 E:M 65616

xxxajk commented 5 years ago

Doesn't seem to help much. Basically the wifi send is what is falling apart. I put a loop test for retry to force reboot on failure on the WiFi side, which returns 1 when the WiFi dies. e.g.

err_t rv;
do {
    rv = orig_output_ap (outp, p);
//os_printf("TX Returned %d\r\n", rv);
} while(rv == 1);
    return rv;
// return orig_output_ap (outp, p);
}

Which will cause the WDT to force a reboot. I can already hear you saying "but... but..." It doesn't matter at this point anyway, since when this happens WiFi is dead anyway. It was the easiest way for me to track down where it was falling apart, and to print statistics. Perhaps there is some step we need to take before sending the packet? Status check?

xxxajk commented 5 years ago

BTW, espressif has now bumped the SDK to version 3... I'm going to try that and see what explodes. Supposedly (with the proper options) it gives you more IRAM and has a ton of bug fixes. I'll let you know how it works out, and create a new branch for it.

xxxajk commented 5 years ago

Been digging around, and as far as SDK, seems Arduino is using v2.2.0-28-g89920dc. I know for a fact that this has made things more stable here, so what I will do is patch the esp-open-sdk to use the same version. The Arduino version also contains all of the CVEs

martin-ger commented 5 years ago

Just made the polling (aka software interrupt) a compile time option "enc_polling" branch (#define ENC_SW_INTERRUPT 1 in espenc.h) and reverted the changes in the send logic - now it works for me...

I'm really interested in the results with the newer SDK!

xxxajk commented 5 years ago

The SDK is building, I'll let you know ASAP.

xxxajk commented 5 years ago

Great news! Newer SDK does fix the issue! All that it needs now is to somehow do a yield if the enc connection is getting pounded too hard (ping -f will cause WTD to bark)

martin-ger commented 5 years ago

Good News! What exactly do you have to do the get the WTD bite?

Just give the ENC_SW_INTERRUPT 1 another try - currently it works great for me...

xxxajk commented 5 years ago

this test one was with the polling branch from yesterday.

xxxajk commented 5 years ago

Will have to do a few makefile changes for the latest to work. There's an error in the wifi repeater with the compile/link options. Should only have -Os. the -O2 will cancel the -Os, and worse, there isn't any difference in speed. The LD flags are missing a few important parts that will reduce iram usage a shit-ton too. I'll do pull requests on both shortly.

xxxajk commented 5 years ago

also... user/user_main.c:3625:8: error: too few arguments to function 'espenc_init' :-)

martin-ger commented 5 years ago

The repeater also has an enc_polling branch now, that does the job. It is generic and works with both: sw interrupt and direct interrupt.

xxxajk commented 5 years ago

got it... Will push Makefile fixes, and sdk updates shortly.... testing now...

xxxajk commented 5 years ago

[lwip/netif/espenc.c:enc28j60_link_output:170] transmission failed (57 - ff) [lwip/netif/espenc.c:enc28j60_link_output:170] transmission failed (3 - ff) (enc death) But WiFi is working.... no more loss of wifi...will try with the soft option as 0 and see...

xxxajk commented 5 years ago

With zero, works!

xxxajk commented 5 years ago

Try my stuff, using my SDK branch. Works fantastic. Pull requests filed.

xxxajk commented 5 years ago

One thing left to do I think... I'm wondering if lower edge is enough for IRQ. Does the esp8266 support IRQ on low level? I think that would produce more reliable RX. I'm not 100% sure here, but I think it may be missing an IRQ here and there with very heavy traffic, and may be happening during the TX, where the ISR is reinstated and the esp8266 misses an rx, or worse. While the most reliable yet, I'm hopeful that just IRQ on LOW would trigger instead of just low edge.

xxxajk commented 5 years ago

Have a new patch I am testing. SUPER RELIABLE now. Been streaming data full-on for 06:26 before WiFi fails, which is more than enough time for what I need. Will push and do a pull request soon as the testing is done. System uptime: 0:06:26 21851 KiB in (15650 packets) 803 KiB out (10547 packets) Power supply: 3.335 V Phy mode: n Free mem: 56344

New SDK provides a TON of free heap too!

xxxajk commented 5 years ago

System uptime: 0:05:00 23093 KiB in (16511 packets) VROOOM!

martin-ger commented 5 years ago

Eager to learn about your patch!

The free RAM with the new SDK is really impressiv!

xxxajk commented 5 years ago

Yeah, Free heap sits right around that number. That's one of the new "features" of the newer SDK versions.

xxxajk commented 5 years ago

One thing I am starting to notice, is that getting the data out of the chip from within the ISR on larger packets is causing the crashes. I tried bumping the SPI speed up (enc chip is spec'ed at 20MHz) but that didn't help, or failed to init. I'm wondering if we can somehow trigger the reads to occur in a task from the ISR, but out-side of the ISR context. I may also be able to still fool it to be outside of the context while still in the context too. Shouldn't matter too much though, since I have the ISR stuff nailed down very well... the enc28j60 no longer produces an IRQ shit-storm, which helps a lot.

xxxajk commented 5 years ago

System uptime: 0:11:44 65697 KiB in (46825 packets) 2406 KiB out (31574 packets) If it passes the 30 minute mark, I'll do a pull request.

xxxajk commented 5 years ago

System uptime: 0:23:05 85513 KiB in (60966 packets) 3131 KiB out (41098 packets) Close.... Has to be one last race condition lurking someplace.

xxxajk commented 5 years ago

System uptime: 0:30:13 169408 KiB in (120769 packets) 6204 KiB out (81416 packets) I have a winner!

martin-ger commented 5 years ago

That's great! Waiting for your pull request ;-)

As I were busy today, I only had some minutes spread over the day. What I realized is, that for me, there might be also electrical problems. When I now wanted to restart a fairly working config, it did not even boot - same software.

My test system is a simple breadboard construction. Maybe the MHz over jumper wires is a reason why sometimes the read errors increase? Perhaps I should try a more solid setup with possibly shorter connections.

xxxajk commented 5 years ago

It is STILL going! So happy, and I think it is totally solid! System uptime: 0:46:12 260497 KiB in (185700 packets) 9538 KiB out (125176 packets)

Yes, breadboards are bad for high speed signals... I soldered the esp8266's shield direct to the eth jack, and use short wires. Will be pushing soon. I'll send you a photo of how I joined the two later. It is now quite possibly the world's tiniest router.

xxxajk commented 5 years ago

pull requests done. There's one pesky conflict, but that is easy to resolve. Do you want me to resolve it?

xxxajk commented 5 years ago

have to do another pull request, you killed a macro I was using... no big deal though