Closed TerryE closed 7 years ago
With regard to UDP, it is common programming practice to call connect
on a socket and have it send and receive packets just to that destination (i.e. avoid the sendto
calls).
I agree wholeheartedly with your first goal!
Also, if you are going to rework the net module, then making sure that it will be possible to move to IPv6 would be a significant advantage. I know that the current SDK does not support v6, but it is supposed to be coming.....
With regard to UDP, it is common programming practice to call connect on a socket and have it send and receive packets just to that destination (i.e. avoid the sendto calls).
I agree that this is simpler but it is still not the same as the current model, because with this on a UDP server we would still want to use a net.socket for I/O and open a new socket if an inbound connection from from a different IP address, rather than the current "last sender gets the reply, regardless"
1) Yes, though this can be tricky due to the SDK internally failing to test for failed memory allocations and crashing & burning as a result thereof.
2) Yes. We might need to break backwards compat though, which would be unpleasant.
As a general comment, the SDK's espconn layer attempts to ride a middle ground between a traditional socket API and a more highlevel stream-oriented API which hides the complexity of TLS from the user. For the better part I think it works on the TCP side, but the UDP aspect has been challeng(ed|ing).
In terms of resource configuration, I think the sensible thing it to introduce a new net.config()
call to allow the application to programmatically set the max numbers of connections per listener, etc.. This would also have the advantage tha if you weren't using TCP, say, then you could set the TCP resources to zero, and use the RAM for other purposes.
In terms of UDP, my instinct would be to maintain soft backwards compatibility -- that is correctly designed and written UDP apps would continue to work as before, but some of the nature of bizarre encoding and failure modes wouldn't be preserved.
We are also constrained by the espconn UDP API, but in my view UDP listener should return return a pseudo connection socket which is unique to the IP of the sender, so that if more than one other host is in UDP exchange with the ESP then the reply packets go tot he correct host.
If would also be nice to be able to have an optional port parameter on the send because some UDP protocols use different ports for the server and client ends.
We might even want to consider not using the espconn layer for the UDP path, and use the lower-level LWIP api directly instead. I took that approach when I wrote the SNTP module, because the espconn abstraction was more a hindrance than a help at the time.
This one is a biggy so I am not rushing into it. I want to pick some low hanging fruit off my tasklist first ;-?
@TerryE The work that I am doing to add certificate support adds some new code. Currently I have this in net.c, but I could break it out into it's own file. It adds another 200 lines (at the moment).
Which would you prefer?
@pjsg @jmattsson @dnc40085 I keep looking at the net.c implementation and it's just horrible, which makes it very hard to rework without a total rewrite. If we do do a total rewrite that this will be view as a high risk change, so my instinct is to do this in incremental steps. I would suggest that the first is a code clean up which simply removes all of the deadwood, for example
luaL_checkudata()
calls which throw an error if the metatable is wrong, but the code still then check the result. I've done a previous dry run and this dropped the size of the module by roughly a third in source lines and slightly less in terms of generated code. I would suggest that I don't attempt to fix any of the execution logic on this pass so that functional execution paths are preserved.
Thoughts?
That sounds like a prudent approach if you're willing to spend the time on it!
I tend to interleave my work. When one issue is giving me brain ache then I switch to another. The net module is pretty core and I feel that we do need to progress this somehow.
@jmattsson @pjsg @devsaurus Here is a my first pass of the tidy of net.c. I would really appreciate you scanning this and giving me your reactions.
luaL_checkudata()
checks to see if the type item is a userdata with the given metatable. It will never return a NULL value so there is no point in following one of these calls with a NULL check.if (SECURE(nud))
generates if ((nud->secure))
if SSL is enabled and if((0))
otherwise enabling the compiler to remove the branch (better than the nasty #ìfdef CLIENT_SSL_ENABLE
method chopping if ... then ... else logic in half. Anyway please scan through this and give me your thoughts. Unfortunate since its now 1TBS and 2/3rds the size you can't sensibly do a diff, but I have kept the routine order for this pass so you can do a side-by-side.
An example of a silent error is net.createConnection(net.TCP,1)
if CLIENT_SSL_ENABLE
is not defined. This fails back to creating an insecure connection. Hence a programmer could forget to enable SSL in the firmware build, but the program will still appear to work, silently using an insecure connection. Not good.
PS. This version doesn't introduce any material functional changes (e.g. luaL_argcheck()
gives more informative errors than luaL_error()
but the errors will still occur at the same points / code paths.) If this receives general positive feedback then I can now start to sort out some of the known issues.
Do you plan to issue an PR so we can comment on code in-line or do you expect us to feed back within this thread?
Sorry Arnim but when you look at the review screen in the commit, then you get
464 additions, 1,162 deletions not shown because the diff is too large. Please use a local Git client to view these changes.
So I don't think that doing a PR will help. At this stage I am more interested in comments on the general approach to see if there is any support for doing this properly. The lessons learnt can also carry across to our general module programming guide. I'd you guys like the macros then these might be worth hoisting into a eLua or nodeMCU header to make them generally available. I personally think this is a lot shorter, clearer and more efficient at runtime. I've more redundant code to take out, but my main aim is to have a solid base to fix some of the known limitations of the current implementation.
If you just want to refer to specific code then just use the line no.
I favour the review features of a PR since they enable to have separate sub threads. Keeps things separated compared to a bulk issue thread. But I see your point.
So my comments after a quick read through. Will give it the code a try later.
luaL_checkudata()
I can't match your statements above with the description for Metatables. Are we running on a different flavour of lauxlib? Until now I thought that I don't have to study the code but can live with the online docs.
There are paths through the conditional tree in net_on()
that neither call NEWREF_NUDCB()
nor error out. Will the pushed callback function argument remain on the stack and consume memory until reset?
LUA_NOREF
According to http://pgl.yoyo.org/luai/i/luaL_unref, luaL_unref()
will "do nothing" when ref
equals LUA_NOREF
. The check can be removed IMO.
How is the Lua allocator enabled for your code? It's not obvious to me.
espconn_[secure_]sent()
SDK docs recommend to switch to espconn_[secure_]send()
.
There are still some TABs used for indentation.
All in all the code looks very well structured and the macros enhance readability a lot :ok_hand:.
I favour the review features of a PR since they enable to have separate sub threads.
So do I, but github doesn't seem to support this the review on this level of difference. Also this is compile-clean only. I don't guarantee that it works, yet.
Discarding the check for NULL result from
luaL_checkudata()
. I can't match your statements above with the description for Metatables.
This is a 5.1 tweak. See here. The online PiL is for 5.0, and the wording in current book is consistent with the code "It raises an error if the object does not have the correct metatable, or if it is not userdata".
Memory leak when registering unknown methods? There are paths through the conditional tree in
net_on()
that neither callNEWREF_NUDCB()
nor error out. Will the pushed callback function argument remain on the stack and consume memory until reset?
I will take a look, but in this case the original probably had the same bug. As I said, I haven't changed the logic this pass. It's just easier to see the flaws now.
According to pgl.yoyo.org
luaL_unref()
will "do nothing" whenref
equalsLUA_NOREF
. The check can be removed IMO.
Correct, but since this is now out-lined into a common routine the check saves a call. If in general the cb isn't unref, then you are right, we should still remove it. Next pass.
How is the Lua allocator enabled for your code? It's not obvious to me.
Correct. It's not. This is for the next pass.
SDK docs recommend to switch to
espconn_[secure_]send()
.
Yes, next pass.
There are still some TABs used for indentation.
Thanks. Will clean these up.
All in all the code looks very well structured and the macros enhance readability a lot.
Thanks. That was my aim. It's a lot easier to spot errors when the code is readable. In this case the binary is smaller and the execution slightly faster as well. :smile:
I've only had a very cursory look at it, and it didn't want me to run screaming for the hills. I'd say keep at it!
Gosh, this one goes on and on. I discussed using the Lua memory allocator system for espconn resources. Bad idea. Why? Because if you look at the espconn code, it can clone / copy espconn blocks (and free them) and it uses the OS allocator system to do this. I know that with the nodeMCU implementation, the Lua allocator is just a wrapper around the OS one, but one of the nice features of it being a unified hook is that you can rehook this in a test harness to check for leaks, etc. and this will all die messily if you do this rehook when some of the memory objects are allocated / freed by different systems (Lua / OS).
The API documentation also includes the caveat:
void *arg
: pointer corresponding structure espconn. This pointer may be different in different callbacks, please don’t use this pointer directly to distinguish one from another in multiple connections.
Which is another way of saying: don't assume that the esconn
block that you get passed back in a callback is one that you have allocated. I need to follow through the alloc / free logic here on some use cases.
Final comment here: most of the espconn routines can error out early for a variety of reasons, the most frequent (but not only) being an out of memory error. Needless to say, the current implementation always assumes successful execution. Durrhhh.
This is awful. It seems that getting rid of the espconn layer is the only viable solution..... However, that is probably a significant amount of work.....
Sorry Philip, I should have qualified this. The current net.c
implementation always assumes successful execution. The ESPCONN layer might have a few holes, but its our code (not Espressif's) that is horrible.
Incidentally I note that the SDK now supports
#ifdef SERVER_SSL_ENABLE
espconn_secure_set_default_certificate(default_certificate, default_certificate_len);
espconn_secure_set_default_private_key(default_private_key, default_private_key_len);
espconn_secure_accept(&esp_conn);
#else
espconn_accept(&esp_conn);
#endif
@jmattsson @pjsg @Alkorin et al, I've just started testing my cleaned up version of net.c and I can push a copy to my github repo if anyone is interested, but I must admit that I am rather ambivalent about all of this. I feel that this new version is a step improvement over the old code, but so much of ESPCONN subtleties are undocumented, so I have to keep trawling the ESPCONN code to retro-engineer these.
What really concerns me is that as far as I can deduce, the Espressif guys have added major functional enhancements in the various 1.x SDK releases which have stretched the ESPCONN parameter block interface well beyond its original design. The struct espconn block is now really an input parameter block, which is then cloned in the espconn layer -- possibly multiple times as you can my accept on multiple ports for a server, and servers can accept multiple sockets. This introduces a extra layer of object management and GC on top of the LwIP one and the Lua one. Hence we have ~1K lines of net.c + ~4K lines of espconn source over LwIP whereas a direct Lua module over LwIP only needs to use the Lua memory system and can dump a lot of this complexity. Even so it might be say 2-3K lines (probably on the low side).
So I would appreciate some comment and feedback. Do I press on with the cleaned up net.c over ESPCONN? Are people more comfortable dumping ESPCONN?
There are other modules which make use of ESPCONN today. (Almost) Every module that touches the network does so with the ESPCONN stuff. Changing all of that will be a big struggle..... Before we can dump ESPCONN entirely we will be stuck having to maintain both approaches.
On the upside, we can fix a lot of the annoying features of the current stack.
In my mind the only major "good" thing the ESPCONN layer provides is the same API for plain vs TLS connections. As Terry said, internally the ESPCONN design model has outgrown itself (and I'd argue that happened the moment they added support for TCP server sockets), and that's ignoring the rest of the code quality in that layer.
There is nothing stopping us from doing a gradual shift away from ESPCONN module by module if that's what we want to do. The SNTP module never used ESPCONN for example, and the EUS module was recently moved over to regular LwIP.
@pjsg Philip, the modules which use the ESPCONN layer are coap, http and mqtt. Changing these to use the "native" net library would be a relatively small change compared to the swap-out of ESPCONN in net.c itself, especially if I exposed the basic net API as a C callable interface.
However, let's see hat testing this new net.c module throws up. I've now got a better idea of the edge cases that will cause it trouble.
@TerryE push your code somewhere, I'll test it
@Alkorin Thomas, let me do a first pass set of tests to iron out the obvious stuff. I will push it up to my Github fork then, and post back the id here.
I just noticed that the sock:on('reconnection', ...)
which we have documented will never actually fire, as internally the net module maps the "reconcb" (really, it should be "errorcb") to the disconnection
event. I think we need to keep this for backwards compatibility, and we should probably deprecate the reconnection
registration as part of the cleanup.
And to add to that, currently the reconcb is only registered once the socket has connected, so DNS failures are never noticed other than with a printf. That needs to change.
Yup, already picked this one up. When you look at the ESPCONN code, the name is a misnomer. It's actually an unexpected disconnect (as opposed to a negotiated disconnect), e.g. the link has timed out or the far end has aborted the link. I suppose it is called reconcb because this gives the app the opportunity to reconnect if that is appropriate.
What I've done is to flip these two around, so that the disconnect callback returns an error code, which is ESPCONN_OK in the case of a normal disconnect. Of course the cb function is free to ignore this, maintaining backwards compatability.
This is the sort of detail that I meant re having to trawl through the Espressif code to work out what is actually going on.
Sorry guys, but this is all slow going, made worse by my own non-nodeMCU commitments. The new error checking around the espconn calls continues to throw up interesting issues. For example, espconn_disconnect()
can be used to close a TCP socket, but it barfs on an TCP server close with an illegal argument (the server espconn block). No, at some point the use of espconn_delete()
has been extended in the case of TCP servers to be the logical close of a espconn_accept()
, but you need to trawl into the espconn_tcp.c
to discover this. Upwards and onwards.
There is another issue with the UDP "replay to last sender" policy: There is no way to use the built-in multicast feature (what would be very desirable for IoT nodes)!
Multicasting works in receiving from, and sending to a specific udp address+port (say, 240.0.0.240:7777), so that all net.multicastJoin()ed nodes share that communication.
This works well on the receiving side; but socket.send() (and server.send()) sends to "random" addresses (last received, or even 0.0.0.0:0 on startup). There has to be a method to nail down send() to a specific address+port, otherwise any data will get unicasted to the last received node (which definitly breaks the communication mesh).
At last I see no way to establish multicasting so far... hints, anyone? (If the current implementation really can't cope with this (but hoping for the best:), IMHO the most intuitive solution would be to adapt socket.connect(port,ip) (which makes no sense with UDP anyway) to specify a target address bevore send()...?)
Sorry guys, building a house got in the way of completing this, and @djphoenix Yuri's work on #1379 now takes precedence. I need to get myself up to speed with what Yuri has done, and then I will update and reissue this if appropriate, once I've gone through his changes - - or close this.
Was fixed with #1379.
I recently rebuilt the firmware and had an unexpected failure (is one ever expected?). After spending much time and drilling through the fw code I found that for UDP connect()
was removed and is now part of send()
.
While this makes sense (UDP does not have the concept of connection) and I like it, I think that if this change is folded into master without backward compatibility then this fact should be highlighted.
@eyaleb Don't need to look in code but in docs :) I think there are case where we haven't option for backward compatibility. And so I think we're should note it something in changelog (every project have a changelog? Hmmm...) Anyway firmware update always is breaking-aware.
I did look at github and it does not say much, this is all there is:
net.udpsocket:send() Sends data to specific remote peer.
readthedocs is similar.
Then again, this is dev
so I expect surprises or omissions and I should be able to deal with them. My note is looking forward to when we hit master
.
@eyaleb thanks for the note. I agree we could do better regarding this specific issue and I created #1701 for that.
We don't have a dedicated changelog, nor did I want to maintain one, but the major changes are highlighted in the release notes. Of course that doesn't help you if the changes in question haven't been released yet. We also keep (more or less internal) notes for each milestone at e.g. for 2.0.0. I once documented that in the contribution guidelines.
@eyelab, the points that you make are valid, but don't belong on this issue, since it has been superseded by @djphoenix Yuri's work.
The net module attempts to provide a unified model for TCP and UDP and really fails to paper over the cracks in how these are structurally different and in particular that UDP is not connection orientated. In the current implementation:
IMO, this is a mess.
Thoughts, comments?