RFC on Tasking Inside our NodeMCU Framwork

TerryE commented 5 years ago

The NodeMCU architecture in essence.

NodeMCU works broadly the same as Node.js (here is a good overview). On the ESP variants (RTOS for the ESP32 or the non-OS SDK), a Lua application is composed of a set of tasks organised and scheduled through a Single Threaded Event Loop Scheduler.

Each task typically has a thin C initiator which then calls a Lua function that may call other Lua functions in turn, but then the whole runs to completion. The event scheduler will then start the next task ready to run based on FIFO within priority. The whole framework is based on the rule that individual tasks run to completion and are not interrupted by other tasks, so the system as a whole can be implemented in a single processing thread. For this to work, tasks should be short, sharp and non-blocking. Each tasks is typically initiated based on an external event: a timer has fired, a GPIO has been set, a network packet has arrived, and so these are referred to as callbacks in SDK terminology. Because the Lua VM only executes one task at a time we don't need mutexes or other fancy task synchronisation mechanisms. Multi-tasking is cooperative: a task yields control by terminating.

In a typical well-written ESP Lua application, most task are short and execute within a few milliseconds so the ESP processor can complete 100s of tasks a second with minimal overhead, making it and NodeMCU really well suited to embedded IoT applications.

So a typical implementation pattern for a task is that is comprises:

A initiator coded in C which is scheduled in response to an external event. For example a network socket event, such as receiving a TCP packet, invokes the routine net_recv_cb() which then decodes the event and decides which Lua action function to execute.
There is typically a 1-1 association with a Lua-callable booking function which can book such events and associate the correct Lua function with the event occurring, in this case net_on('receive',func).

Because each task exits from a Lua VM perspective, that is the Lua call stack unrolls entirely, the only Lua variables that are preserved from task-to-task are stored in the Lua environment (_G) and in the Lua Registry or their direct children. The Lua GC will collect all local variables created and released during the task execution.

Because Lua task functions must persist from task to task, this are all stored in the Lua Registry and referenced using an integer handle. The booking function will use the luaL_ref() API to allocate this registry slot and obtain the handle, and then the event routine will retrieve the task function using this handle and then call luaL_unref() to return the used slot to the pool, before executing a lua_call() to execute the Lua task.

This is a pretty fixed implementation pattern but we haven't encapsulated this in a higher level API, so there are subtle differences in how this is coded from task to task. Not good.

Whilst NodeMCU as a whole makes very effective use of this framework through its modules library, ironically the core Lua VM does not. This is possibly because the Lua port was done first to bootstrap the implementation. A good example of where we could use this effectively follows:

Lua error handling and Panics

NodeMCU implements the standard Lua error handling model. In this any call level can establish an error handler as part of calling a sub-function. If errors are thrown in this sub-function then they are caught by the error handler. If an error is thrown and not caught by an error handler then it is caught at the top level by what is known as the Lua Panic handler, and on NodeMCU this emits a terse error message to UART0 before rebooting the ESP. This makes Panic errors very difficult to diagnose.

There is absolutely no reason for panics to be handled this way. If we look at a typical pattern for calling a task function:

 if (ud->client.cb_sent_ref != LUA_NOREF) {
    lua_rawgeti(L, LUA_REGISTRYINDEX, ud->client.cb_sent_ref);
    lua_rawgeti(L, LUA_REGISTRYINDEX, ud->self_ref);
    lua_call(L, 1, 0);
 }

Here we are calling the function with the handle ud->client.cb_sent_ref passing the userdata ud->client.cb_sent_ref as context. If this cb_sent_ref routines throws an error then this will panic and reboot the ESP. Why do this? If we replace this with a pattern:

lua_rawgeti(L, LUA_REGISTRYINDEX, ud->self_ref);
nodemcu_call(L, ud->client.cb_sent_ref, 1, 0, 0);

We can not only save on coding space, but also get panic handling with full error traceback 'for free'. There are 71 such fragments in the modules directory so doing this is a pretty straightforward batch edit. We would need one extra node call node.atpanic(function) which established a non-default panic handler. The nodemcu_call() would be something along the lines of:

int nodemcu_call (lua_State *L, int ndx, int narg, int res, int dogc) {
  int status;
  if (ndx = LUA_NOREF)
    return 0;
  int base = lua_gettop(L) - narg;
  lua_pushcfunction(L, nodemcu_traceback);
  lua_insert(L, base);  /* put under args */
  lua_rawgeti(L, LUA_REGISTRYINDEX, ndx);
  luaL_checkanyfunction(L, -1);
  lua_insert(L, base);  /* put under args */
  status = lua_pcall(L, narg, (res < 0 ? LUA_MULTRET : res), base);
  lua_remove(L, base);  /* remove traceback function */
  /* force a complete garbage collection if requested */
  if (dogc) 
    lua_gc(L, LUA_GCCOLLECT, 0);
  return status;
}

Now the call always returns whether or not the function throws an error. However if it does then the nodemcu_traceback() gathers a full error traceback and does a task post to the registered atpanic routine with the traceback as a string argument. The default at panic routine would print this full traceback and restart the cpu. However a production application might log the error over the network.

Other possible uses of tasking within the Lua VM / NodeMCU runtime.

It is moot whether we should regard such features as Lua components (i.e. with a lua_ prefix and part of the lua file hierarchy) or are as NodeMCU ones (i.e. with a nodemcu_ prefix and part of the platform or similar file hierarchy). My view is that these extensions are intimately tied into the Lua VM and we already have a Lua module for the NodeMCU extensions; this uses the luaF_ and is in lflash.c, but there is sound sense in lumping all of these extras together and calling this file lnodemcu.c instead.

Anyway as well as error handling other placs where I am planning to use this tasking model include:

the interactive read loop
node.output() spooling
Smart GC
A Lua coroutining implementation.

Well any considered responses?

PS: ~~or follow the dev-esp32 lead and use luaX_ for this~~ and keep luaN_ for LFS.

devsaurus commented 5 years ago

Excellent concept, fully support that :+1:

TerryE commented 5 years ago

One footnote here. After a side conversation with @jmattsson, I've just realised that the use of luaX_ on the esp32 codebase was introduced by @jpeletier with his MQTT port but unfortunately breaks the Lua internal naming conventions, as luaX_ is already allocated to llex.c. We will stick to luaN_ and possibly add luaW_ for the NodeMCU additions to core VM Lua functionaliy. I will back out the luaX_ references when I add lua53 to the dev-esp32 branch.

TerryE commented 5 years ago

I've had a few other commitments over this last few weeks, so progress has been slow on this, but I now consider this chunk of work as stable.

The pipe stuff for stdin stdout/stderr works fine. So you can just cut and paste large chunks of code into the UART 0 and telnet interfaces without data overrun. There is still an internal limit on Lua strings being <= 4Kb, so individual Lua source chunks can't be longer than 4Kb.
Since the pipe modules handles all off the complexities of marshalling, the telnet source is now a lot simpler and more robust
Stderr errors are reported on a telnet session.
Removing PANICs from CBs is now a simple (typically 1-line) change per CB. These are just reported as an error traceback.

I am visiting family this next couple of days so I will do the PR itself on Thursday. Given that this is an architectural alignment for lua53, I think that we will need to leave it as an unmerged PR for a few weeks, It makes sense to do the next master drop before merging it.

@marcelstoer, are you comfortable with this?

TerryE commented 5 years ago

Implemented in #2836

nodemcu / nodemcu-firmware