nodemcu / nodemcu-firmware

Lua based interactive firmware for ESP8266, ESP8285 and ESP32
https://nodemcu.readthedocs.io
MIT License
7.64k stars 3.12k forks source link

Modification of LwIP #1195

Closed TerryE closed 7 years ago

TerryE commented 8 years ago

Issues #1166, #1181 and PR #1192 highlight an issue where we might be digging ourselves a maintenance hole. Espressif open-sourced their implementation of LwIP in SDK 1.3, based on the LwIP project's version 1.4.0 release dated 6 May 2011 (though the LwIP licence terms don't require this). (See app/include/lwip/init.h for version info).

The current stable LwIP release is 1.4.1 dated 17 Dec 2012, and version 1.5 is still under beta development.

The Espressif implementation comprises a subset of the standard distro (e.g. ppp slip, etc. has been stripped out) that has a lot of added unprintable comments (not ASCII or UTF-8, maybe some DBCS Chinese font?) with some Espressif additions (mdns.c and smtp.c -- though these my be extracts from LwIP applications) and an "app" wrapper which is also Espressif code.

I suspect that our urge will be to fix the buggy areas -- which will tend to be the Espressif additions as the core project code has been hammered. However, we are changing a fork of a fork which is always going to be a major maintenance headache, so before we do this I feel that we should at least discuss this issue an reach a consensus amongst the contributors before we go down this route.

My personal view is that we should avoid bug-fixes to Espressif-supplied code which they themselves might change across SDK versions, and that we should upstream and fixes to Espressif.

If we do change this code then we should do it in a very controlled method -- as Johny has set up in his skeleton sdk-overrides, by copying in the sdk components into our firmware tree then making any changes. though I would prefer one or more .patch files here so it very clear what changes we are making.

We should also make these on a very conscious basis where there is a clear case for doing so.

TerryE commented 8 years ago

By way of example on my last point, in my view there one area where we should consider changing from the core LwIP library and that is the memory allocator. This is a hook in LwIP and there are various configuration options to tailor this. As far as I can see, Espressif have simply mapped this to os_malloc.

What I am not sure about is how robust LwIP is if memory exhaustion occurs. It's typical strategy is to do a if(!p) return 0 type of abort, but the upper levels often don't handle this zero status return correctly so weird failures can occur that will be very hard to diagnose when the fundamental issue is that the app has exhausted the heap.

Lua is somewhat different to normal C applications in that it is heap based, but with a lazy garbage collect strategy. The current default implementation therefore performs GC on every alloc and as result Lua runs about 10x slower than it needs to. However, we have to balance this performance advantages of lazy GC and not killing the IP-based stacks. Need to think about this some more.

pjsg commented 8 years ago

Have you had any luck in upstreaming changes to Espressif?

I think that the mdns code is purely from espressif. There is no mdns code in the lwip tree as far as I can see.

The other option is that we just replace the provided mdns code with our fork... It then isn't a change on the original code, but just a replacement (it gets put somewhere else in the tree and the api is renamed so that there aren't any collisions). If we did this, then I'd be inclined to remove all the code that I don't understand (I think it is left over from some other project).

The whole issue of memory allocation is something that we need to deal with. One approach might be to reserve a smallish pool of memory for LWIP (and friends) and fall back to using that if the main pool is empty. This might be equivalent to performing a GC if there is less than (say) 4k free on a luaM_malloc call. However, I do get slightly worried about fragmentation and not being able to satisfy large requests.

With regard to the version of LWIP -- they have said that we will be getting ipv6 support sometime (soonish). This will need 1.5 (or so). Lots of things will change at that point!

TerryE commented 8 years ago

I think that the mdns code is purely from espressif.

I think it was written by a grad for the IP Puck project. Here is the repo and here is a PDF overview of the project.

LWiP optionally has its own pool system, but I am not sure of the memory impact. My instinct is that trying to run two parallel allocators in a 48Kb RAM pool is going to end in grief.

pjsg commented 8 years ago

I have put up another PR #1197 which just copies the mdns code out of the lwip directory into a new directory for network related stuff. I think that we are better off maintaining this ourselves.

TerryE commented 8 years ago

I note that the Espressif 1.5.2 open LwIP release includes bugfixes for both TCP and UDP close logic for both espconn abstraction layer and the core tcp_in.c and udp.c, but is still based on the LwIP 1.4.0 version rather than the current stable 1.4.1 version which incorporated quite a few bug fixes.

I also note the discussion in EUS about abandoning the whole espconn abstraction layer and accessing the TCP stack directly. Is this a serious option? What I would be uncomfortable doing is is a half-and-half mix and match.

We still have this issue of tuning the LwIP stacks memory allocation system to align with Lua and its lazy GC approach.

jmattsson commented 8 years ago

Since Espressif at least provides a source version of the LWIP stack we get to see the murky waters which lie within the espconn layer. For the TCP-server path, there currently isn't any safe way to actually shut down the server, which is what led me to switch from espconn to native LWIP for th enduser_setup module. The native LWIP API is fully event oriented and seems largely sane. I say largely, because when being the initiator of a connection teardown (rather than responding to a remote close), the stack appears to leave things around in some fin_wait state or other, pooling memory like a squirrel in autumn. The espconn layer appears to avoid this somehow, but I haven't made my way through that rabbit warren to understand how just yet. What I have chased down so far all just ends up with tcp_close() which is the very same call which has given me grief. My current thinking is that I must've been ignoring a return value somewhere I really shouldn't have, and that's why my tcp_close()s weren't doing what I wanted them to. Anyway, what I was trying to say before I went off on that tangent was that the actual LWIP API is probably more usable (and a lot less buggy) than the espconn layer. Google for lwip rawapi.txt to pull up a copy of the API doc for it. It has a well-defined object ownership model, which was a refreshing change from espconn.

The two downsides to hitting the native API instead of espconn are that firstly it's not what ESP-coders are used to, and secondly for TLS we're still stuck with espconn_secure (which may be better or worse than the plain espconn abstraction, I haven't look yet). For the enduser_setup module we didn't have much of a choice if we wanted to get it stable at this point in time. I also haven't worked out what the espconn layer actually adds in terms of features over the native API...

As Terry has already said, attempting to fix up the espconn stuff is going to generate conflicts whenever a new drop is made available by Espressif. I haven't had a stellar track record with convincing Espressif to fix things in this area, even when they're security related, so it's definitely not a case of just having us fix it and raise a PR to Espressif, alas.

We're in a bit of a bind here with no clearly good option. I think what might be the best option would be if there is an LWIP-compatible TLS stack we could use as an alternative to espconn, at which point we'd no longer need to use either of the espconn abstractions. Of course, if the espconn code matures to a point where it becomes stable and usable, that would be handy too. Right now I don't have much faith in that.

robertfoss commented 8 years ago

Re: LWIP-compatible TLS stack esp-open-rtos uses mbedTLS, so that seems like a viable route.

TerryE commented 8 years ago

I wonder how many more years the esp8266 will be in volume production? Maybe 1 or 2 at the most. If it does get replaced by the esp32 then I suspect that we'll be needing to port the nodemcu firmware to esp-open-rtos. At the moment we can't fit esp-open-rtos and the Lua RTS within the esp8266 resource constraints.

My view is that we focus on doing stuff that (i) enhances to overall usability of the Lua environment but (ii) isn't likely to be junked when we do start supporting the esp32.

As far as I am concerned EUS and Bonjour / mDNS are (albeit large) niche applications; we are reaching the plateau of what we can sensibly achieve with these anyway, so I don't want to have to change our overall direction to get another 5% out of these unless this is a true majority feeling amongst the committers.

At the moment our major challenge is our net layer is flacky. You bypassed it to use the espconn layer directly for EUS and now Johny has moved directly to the LwIP tcp API. I just wonder if we should bite the bullet and go directly to net over LwIP and bypass espconn altogether.

The mbed TLS tutorial says it mostly uses LwIP to implement the TCP layer, but even a lean port of this will require 1Mbyte Flash parts at a minimum to run this and Lua.

TerryE commented 8 years ago

@jmattsson

The espconn layer appears to avoid this somehow, but I haven't made my way through that rabbit warren to understand how just yet.

Track grep ets_post -R app/lwip -B 15 -A 5.

pjsg commented 8 years ago

My take is the the ESP8266 will continue for quite some time -- it is included in all sorts of random devices that manufacturers will want to sell for years (light bulbs, sockets, etc, etc). If the price of the ESP32 gets down to the same level (and there are good arguments why that will take a while), then people will start choosing that for most new designs -- even if the device does not need the extra capabilities.

jmattsson commented 8 years ago

@TerryE I'm not sure what you're hinting at here? The espconn code ends up in tcp_close() just as one would normally when using the LWIP native API. In one particular path it even reaches deep into the LWIP stack to call tcp_abandon() directly, but that doesn't seem to be the relevant path (nor am I convinced it is safe to reach that deep into the LWIP stack, but that's another topic). I haven't come across anything which would speed up the fin_wait. The escponn_server_poll() function would be the obvious candidate, but again this only trickles through to tcp_close().

TerryE commented 8 years ago

Maybe you are right. I suspect the sensible thing to do is to diff LwIP v 1.4.1 against 1.4.0 and we if that includes some fixes. Do you want to do this or shall I?

someburner commented 8 years ago

Don't know if this helps at all, but I remember when diffing Espressif's 1.5.0 LWIP and 1.5.2, they added some stuff, specifically in espconn_client_close() and espconn_server_close(). For example:

espconn_tcp.c (line 956): espconn_server_close(void arg, struct tcp_pcb pcb,u8 type) { ... if(type == 0) // 1 force, 0 normal err = tcp_close(pcb); else { tcp_poll(pcb, NULL, 0); <--- new tcp_sent(pcb, NULL); <--- new tcp_err(pcb, NULL); <--- new tcp_abort(pcb); err = ERR_OK; }

if (err != ERROK) { /* closing failed, try again later / tcp_recv(pcb, espconn_clientrecv); } else { / closing succeeded _/ if (type == 0) { <--- new if statement // 1 force, 0 normal
tcp_poll(pcb, NULL, 0); tcp_sent(pcb, NULL); tcp_err(pcb, NULL) } /_switch the state of espconn for application process*/ ... }

So the major differences are they added a 'type' argument with 0 for normal and 1 for force. And they changed the tcp_poll call in espconn_tcp_accept() from tcp_poll(pcb, espconn_serverpoll, 8); /* every 1 seconds / to tcp_poll(pcb, espconn_serverpoll, 4); / every 1 seconds */

Then in espconn.c they added espconn_abort() which calls espconn_server_close() with type=1, and then espconn_disconnect() with type=0;

In the 1.5.2 documentation they say: "espconn_abort Function: Force abort a TCP connection Note: Do not call this API in any espconn callback. If needed, please use system_os_task and system_os_post to trigger espconn_abort."

And in udp.c they switched a bracket location for an if else in udp_input() that I would imagine was a bug. I use "Meld" for diffing stuff in case people haven't heard of it. Makes it really easy to identify which files have changed and highlights the changed parts. Of course, git does that as well but I find the IDE helpful.

jmattsson commented 8 years ago

They're far too soft with their "please use" statement - it really should be "you MUST use... or you'll screw over your memory in weird and wonderful ways"... Both tcp_close() and tcp_abort() [may] release the passed pcb, so after either has been called you can't touch it. Makes me wish they'd put a non-reentrant mutex around all the espconn functions so you'd deadlock rather than silently start corrupting memory.

TerryE commented 8 years ago

@jmattsson that's what I was trying to say above in short hand. They use task posting to break up and reschedule TCP stack bits in magical and mysterious ways. Sometimes this is essential and they've implicitly left some to the application layer.

pjsg commented 8 years ago

Is this an argument to move to the 1.5.2 SDK (with patch)?

jmattsson commented 8 years ago

I'd be hesitant to pull in 1.5.2 right before a cut to master though.

someburner commented 8 years ago

I'd look into the bracket change in udp.c. Maybe it's not a big deal but it would seem that the if(dest == DHCP_SERVER_PORT) would never gets reached unless the bracket is changed. Perhaps that is affecting your mdns.

jmattsson commented 8 years ago

Having started looking at what it would take to get NodeMCU onto the ESP32 (which does not have a "non-OS" SDK, only the RTOS SDK), it is worth noting that espconn is nowhere to be seen. If we're going to aim for the ESP32, we'll need to wean ourselves off espconn one way or another.

NicolSpies commented 8 years ago

Weekend chatter on Espressif BBS and https://espressif.world.taobao.com/ of the official ESP32 launch in August.

TerryE commented 8 years ago

Thanks, @jmattsson. I'll strip out the ESPCONN layer from the net rewrite and work direct over LwIP. Back from hols at weekend. The down side is that this might add another month elapsed to the work, since I've got so much on a the moment. Sorry

israellot commented 7 years ago

The RTOS version code seems to have a LWIP implementation that contains PPP. Is that usable? Anyone tried? https://github.com/espressif/ESP8266_RTOS_SDK/tree/master/third_party/lwip/netif/ppp

marcelstoer commented 7 years ago

Since our net module is now based on lwIP this should now be closed I guess.