nodemcu / nodemcu-firmware

Lua based interactive firmware for ESP8266, ESP8285 and ESP32
https://nodemcu.readthedocs.io
MIT License
7.64k stars 3.12k forks source link

Support for two LFS regions and on-ESP LFS building #3206

Open TerryE opened 4 years ago

TerryE commented 4 years ago

New feature

Additional support for two LFS regions, and the ability to both save and load (update) LFS regions on ESP.

Justification

Most committers and many developers would seem to want this.

Highlights

This enhancement will be for Lua 5.3 only as this builds upon the groundwork that I've already laid in the Lua53 implementation.

Example usecases

-- Create a (child) LFS based on all LC files in SPIFFS
do
  local f,a = files.list('%.lc$'),{}
  for k in pairs(f) do a[#a+1] = k end
  node.LFS.reload(a)
end
-- Create a new (child) LFS replacing one specfic function from SPIFFS
do
  local v = node.LFS.list('application')
  local a = loadfile {'mysub.lua'} -- use array form to return an array
  for _,n in ipairs(v) do
    if not a[n] then a[n] = node.LFS.get(n) end
  end
  v = {}
  for n,f in pairs(a) do v[#v+1] = f end
  node.dumpfile(v, 'lfs.img')
  node.LFS.reload('lfs.img')
end

How the NodeMCU binary format differs from standard Lua 5.3

In general terms the Lua RTS dump function determinately traverses a Proto hierarchy converting all fields to a stream of binary tokens and this stream is the compiled file format. The load executes an "undump" which does the inverse traverse recreating the Proto hierarchies. This much is the same. But as to why the differences:

Technical Issues

Cache coherence.

I currently do a botch to flush the ICACHE, and that is to read a sequential 32Kb address window in flash. @jmattsson: Q: do you know a better way?

jmattsson commented 4 years ago

A Cache_Read_Disable()/Cache_Read_Enable() pair should do the necessary register twiddling to reload the cache window into the SPI flash. IIRC it's a single control register in DPORT0, but I don't seem to have a handy definition document sitting around anywhere I can find it right now. Maybe I lost that in the disk crash the other year? Anyway, just make absolutely sure that the call to Cache_Read_Enable() is either present in the instruction cache (i.e. not in what would be the next cacheline) or that you put the call pair explicitly into IRAM :)

HHHartmann commented 4 years ago

Wow, that is one writeup of all your prepartion of this and the descussions about it.

I stumbled across this example:

Example usecases

-- Create a (child) LFS based on all LC files in SPIFFS
do
  local f,a = files.list('%.lc$'),{}
  for k in pairs(f) do a[#a+1] = k end
  node.LFS.reload(a)
end

Does the node.LFS.reload(a) really work incrementally and how do I then reset it? Or is it connected to content of the file multifunction vs. single function?

TerryE commented 4 years ago

A Cache_Read_Disable()/Cache_Read_Enable() pair should do the necessary register twiddling to reload the cache window into the SPI flash.

@jmattsson, thanks. Yup , the SDK SPI routines use Cache_Read_Enable_2() and Cache_Read_Disable_2() and what these do is to is to temporarily turn off the cache without flushing it, and this is clearly gives a lot faster performance, but the kicker is that you can loose cache coherence if you are doing an SPI write to an mapped address range. I also understand that the disable/enable pair need to be executed from IRAM0 and that the enable must have the args (0, 0 , 1) for our configuration, and that is having a 32Kb ICACHE mapping 1 Mb starting at 0x000000. DiUS Ltd will need to tweak this with their OTA fork.

jmattsson commented 4 years ago

Thanks for the heads up. For this particular purpose you might want to deliberately use the ROM'd Cache_Read_Disable()/...Enable() rather than the "improved" SDK versions.

Oh, and I found the reference to the control register: https://github.com/esp8266/esp8266-wiki/wiki/Memory-Map It's got both the flash cache and the iram cache info there, if you prefer to hit it directly.

TerryE commented 4 years ago

Does the node.LFS.reload(a) really work incrementally and how do I then reset it? Or is it connected to content of the file multifunction vs. single function?

@HHHartmann. Future tense: it will work this way, post the PR. That's because there will only be one dump format. A luac.cross image file is just an LC file with all of the modules in it.

This implementation is continually trying to achieve as much scaling as possible whilst working within the ~44Kb heap limits available on a clean restart.

The dump process is pretty lightweight in that it is walking Proto hierarchies in RAM and LFS and serially dumping them to file. The kicker is that I need to collect all of the strings used in the dump and then append them to the dump file.

How much of a realistic limitation is having a max of 512 string constants per dump? I doubt that many developers would hit this. and the workaround is to split the dump into multiple files.

A slightly more complex alternative (a bit geeky if you weren't into this sort of thing) would be to allocate a custom fixed, say 2K × 4-byte, hash vector with a quadratic hash and a packed 12:2:18 structure (ndx, source, iTString). You are storing index, pairs in each bucket and this could just be an {int ndx; TString *} pair but could pack down into a single word given that the max index is 4K, say and the TString is a word aligned offset from DRAM0, LFS0 or LFS1 but the `{int, TString}` version is just easier to code. This would execute as fast and give a 2K string constant limit, but would need a few dozen lines of extra code to implement.

TerryE commented 4 years ago

BTW guys, (and girls if @sonaux is tracking this -- why is IoT such a guy thing? :cry:) I have spend quite a few months brooding about implementation strategies to improve the functionality vs simplicity of implementation trade-off. This approach is about as good as we can get IMO.

TerryE commented 4 years ago

One small implementation issue that I need to address is that in standard Lua the dump and undump processes are strictly serial. And hence Lua uses a lua_Writer abstraction to manage output (with an exact parallel lua_Reader)

typedef int (*lua_Writer) (lua_State *L, const void* p, size_t sz, void* ud);

The type of the writer function used by lua_dump. Every time it produces another piece of chunk, lua_dump calls the writer, passing along the buffer to be written (p), its size (sz), and the data parameter supplied to lua_dump.

The issue is that the dump produces <fixed header><one or more protos><dump of strings> but the undump wants to process <fixed header><dump of strings><one or more protos> which is trivial to implement so long as we can fseek along the input or output stream. To this end I am extending the interface:

Hence the write process is \\\\\.

The read process is \\\\\\.

This does mean that the dump and load can only use true files and cannot process streams like stdin and stdout, but I don't view this as a functional limitation.

Anyone see any issues with this?

TerryE commented 4 years ago

When it comes to ESP development cycles, we have to discuss the "elephant in the room":

  1. The limited amount of RAM heap (perhaps 44Kb on an ESP8266) limits the size of application that can be developed using a purely ESP-based development cycle.

  2. Moving compilation and building of LFS off-ESP and onto the host environment significantly increases the scale of applications that can be be developed, but this brings with it all of the issues that especially seem to trouble Windows-based developers of building and executing the cross-compiler toolset.

I want to focus on this first point and how we understand and mitigate the various scaling issues that constrain this life cycle.

% Occupancy # Comparisons for hit # Comparisons for miss
50% 1.25 1.53
75% 1.37 1.75
100% 1.50 2.01
150% 1.75 2.47
200% 2.00 2.98

Whilst the broad usage is as in the previous post, RAM availability and fragmentation is going to be an issue and as a community I feel that we will need to develop build processes. I feel that for everything apart from small setups, on-ESP LFS reimaging will involve a number of steps carried out immediately after restart accessing some Lua based "build script" and using an RTC memory counter for stepping through the build. Maybe we regard the first iteration and Alpha and subject to change, until we get a handle on how usable it is and what the practical scaling limits are.

jmattsson commented 4 years ago

I'd consider a two-pass approach entirely reasonable (and possible preferred, depending on internals). Either of the hash size options seems acceptable at this stage. We can always revisit later if needed.

nwf commented 4 years ago

FWIW, I think "take as many passes as you need" is a perfectly reasonable approach, especially given the very limited heap space. I could even imagine this being something like flashreload("lfs-%s*.lc") turning into:

TerryE commented 4 years ago

@nwf Thanks for this useful feedback. Let me play a little ping-pong on your points.

jmattsson commented 4 years ago

Maybe a dumb question, but if we're doing multiple passes anyway, do we still need to have the string table before the protos? As in, do we need to deviate from the default dump format? The fewer deviations we have from standard Lua, the easier to upgrade in the future (as you undoubtedly know from first-hand experience by now). If I've misunderstood one of the finer points here, do feel free to just point me to a whitepaper or something you've already written :D

TerryE commented 4 years ago

Maybe a dumb question, but if we're doing multiple passes anyway, do we still need to have the string table before the protos?

In a word Yes. That is in terms of processing order. The in-file order can be resolved by the odd fseek.

The current standard Lua undump process is designed to go from a (serial) data stream to randomly addressable memory. The idea of having to use a serial API to program flash memory was never considered as a non-functional requirement (NFR) during file format design. We want to serialise the undump to LFS as much as possible and this creates two design objectives:

The limited RAM issue also means that we can only afford to cache limited resources in RAM, so repeating my previous example, we can afford to keep the ROstrt vector of TString pointers in RAM, but not the TStrings themselves. They mush have been preloaded into LFS and the cache flushed so that they can be directly address for resolution during load.

There is also an issue of density. Our current format is about 50% the length of the equivalent standard Lua compiled file formats. This has non-trivial benefits in saving network transfer times and file system utilisation.

Adding our NFRs and relaying out the file format actually makes the dump and undump processes simpler, and significantly less RAM intensive. The corollary here when we are RAM-constrained is that on-ESP life-cycle applications can be larger.

jmattsson commented 4 years ago

Maybe a dumb question [...]

In a word Yes.

Gotcha :D Carry on! 👍

TerryE commented 4 years ago

@jmattsson, this one might amuse you.

This might be counter-intuitive, but making the serialised (LC) dump format compatible with both writing to LFS and to RAM actually removes a shed load of now redundant code -- for example there is no longer a -f option in luac.cross as the normal LC format produced by -o is the LFS image format. There is only one lua_dump() function and one lua_load() function, etc. Fun, fun, fun!

jmattsson commented 4 years ago

Nice! Always so satisfying to be able to remove code!

TerryE commented 4 years ago

A quick overview of the LFS load algo

Unlike the standard Lua version of undump, this NodeMCU version supports an LFS mode, and so the undump function supports storing Protos hierarchies into one of two targets:

  1. RAM heap space. This mode is a single pass with the Proto structures and TStrings created directly in RAM. All GCObjects are collectable and comply with Lua GC assumptions, so the GC will collect dangling resources in the case of a thrown error.

  2. Flash programmable ROM memory. This is written serially using the flash write API. This mode supports LFS. This is a two pass load with the first pass being a read-only format validation and CRC check. The second pass is hooked in during startup and error are unlikely given pass 1. Any error will abort the pass, leaving a corrupted LFS which is detected and erased on next boot. Any reload will need to be manually retried.

The undump code for both modes is largely shared, excepting the top level orchestration and the bottom level write to RAM/FLASH a supplied CB which differs for the 2 modes. Mode 2 requires that writing of separate resource elements cannot be interleaved, so Proto record processing has been reordered to group resource writes and cache the Proto itself in RAM and to walk the Proto's dependents bottom-up. Other than this Mode 1 is largely as in standard Lua, so doesn't really need further discussion.

On the other hand, Mode 2 supports multiple compiled code files each with multiple Top Level Functions (TLFs) and is able to write to the LFS region in Flash. The LFS load process has two passes:

Hence the LFS reload takes 2 restarts. The second is actually optional since we could just restart the Lua environment without a CPU restart. However, the heap will have been fragmented during pass 2 so the restart is prudent.

A power-fail during pass 2 will be detected and result in a fallback startup with a blank LFS. In this case the reload will need to be manually repeated. Given that SPIFFS suffers from worse issues (it doesn't even detect power-fail), doing anything more is over-engineering. IMO.

TerryE commented 4 years ago

I've got the multi-TLF dump and undump working fine. Most of the tuning is around making sure that we don't end up with RAM constraints unnecessarily limiting the size of LFS that can be manipulated and loaded on-ESP. As an example, you broadly have three strategies for sizing the ROstrt:

  1. Add up the size of hash vectors in the N input files and use that. This is likely to a be slight overestimate since some strings will be in multiple input files. This doesn't need any big temporary hash to do this.
  2. Use a fixed open-address hash (say 4k slot) hash to count TStrings (removing duplicates). You need to pick a max size and have enough RAM left when you call the node.LFS.reload() call otherwise it will throw an E:M error.
  3. Use more complicated dynamic resizing algo which uses a similar approach to (2) but grows the hash on demand. This is more complicated to code and in the worst case typically requires 50% more RAM at peak so in practice can scale less well than (2) and still throw an E:M error whilst loading larger LFS images.

BTW the issue isn't so much RAM during the pass 2 which occurs post restart in a clean startup config, but in pass 1 which is called directly from node.LFS.reload() and which validates the load and can return control to the calling Lua app in the case of an error.

In the end given my discussion in my post a couple of weeks ago above which is about sizing the ROstrt, and which observes that the probe performance scales in a very well behaved manner as a function of TString count to ROstrt size, that (1) is the simplest; least RAM use and most robust option. At the moment I have a lot of inline whitepaper style documentation about these decisions in the code, (unlike the rest of the Lua code base which has zero inline documentation and expects the core developer to retro-engineer this from the code itself -- "it does what it is"). I am think about pulling all of this commentary of out the source and moving it into a whitepaper, which we can always make available for anyone interested. This is what I will do unless anyone shouts. :smile:

nwf commented 4 years ago

Honestly, I'd prefer more comments in code. I ascribe to the Donald Knuth "it's a book for humans that happens to have some code in it" approach (and literate programming was robbed of its fair comparison, but that's a separate rant; https://buttondown.email/hillelwayne/archive/donald-knuth-was-framed/ does a good job). If you also want to make a whitepaper, that's fine. :)

jmattsson commented 4 years ago

I'm definitely partial to having solid commentary in the source and have been known to leave big blocks of texts explaining the intent and reasoning. The idea being that once you've read the prose you'll be in a good position to follow the code.

Sure, pure code can be elegant and easily comprehended as well, and doesn't risk the comments going stale, but for any non-trivial source I'm in favour of some in-file documentation. So, if you've already got it there, I wouldn't toss it out without good reason :) That said, I certainly also enjoy reading your in-depth whitepapers!

TerryE commented 4 years ago

This change touches quite a lot of components so coding and testing everything is taking a bit of time. Still it looks like you will be able to build maximal LFS images on ESP, which is a lot better than I anticipated. The biggest constraint is the maximum compile size of a single source module, though I will extend my compilation service to support this.

One thing that I am leaving out of this PR is the two LFS support. I don't see this as high risk but I want to do some performance benchmarking on one ROstrt vs. two. In this second case, the appLFS ROstrt will embed the sysLFS lookup as well. If I confirm the benchmarking issues then I will cover this in a separate post.

TerryE commented 4 years ago

Just a quick update to let those interested that I am not 'on strike', but steadily plugging away at this. The code is basically all there, but testing and debugging the usecases is proving rather complicated. The same dump and undump code has ended up being being a reimplementation, and much a is shared between three save modes: in RAM; in normal LFS; in Absolute LFS. This last one is a real quirk thanks to Johny et al: In the LFSA mode, the pointers are 32-bit and refer to on-ESP addresses, even though the undump code will (typically) be running on a 64-bit PC OS, so size_t is 8 bytes, not 4. Oh yes, to work withing the ESP memory constraints and to avoid caching too much in RAM for on-ESP loads, the dump operations are 3-pass and LFS variant undumps are 2-pass so in some cases whole areas of code need to be dummied out.

The last wrinkle is that all of this also needs to be Lua Test Suite compliant, so TString records get wrapped in a UTString type which is packed out to 8×size_t, and this mustn't break the code. I am now pretty much all of the way there -- at least in the luac,cross -e execution environment and can load, restart and execute LFS images (yes, the same ones as load on the ESP). I've got some other test cases to cover, then I'll move onto ESP testing.

So slow progress, but all still OK.

TerryE commented 4 years ago
$ cat /tmp/d.lua
debug.debug()
$ /luac.cross -e /tmp/d.lua 
LFS image corrupted.

Erasing LFS from flash addr 0x090000 to 0x0cffff
lua_debug> lfsreload{"/tmp/lfs.lc", "/tmp/ftp.lc"}

Erasing LFS from flash addr 0x090000 to 0x0cffff
LFS image loaded
lua_debug> print(lfsindex'telnet',lfsindex'ftpserver')
function: 0x558d56ac2e80    function: 0x558d56acd390
lua_debug> telnet=lfsindex'telnet'()
lua_debug> for k,v in pairs(telnet) do print (k,v) end
open    function: 0x558d56accd90
close   function: 0x558d56acd060
lua_debug> 
$

The host build doesn't have file and net so these modules won't run successfully on a luac.cross -e but at least these can be loaded into LFS and then referenced in code. Still have a number of other main paths to shake down: LFS Absolute mode, on-ESP saving and loading, updates to luac.cross parameters, etc., but the core of the dump / undump code seems to be working well. So still more work to do and the odd gremlins to shake out, but I am now confident that I have addressed the bulk of the technical challenges with getting on-ESP modes working.

Note that the host environment doesn't have node, and hence no node.LFS so my quick host only fix is to add these functions to the baselib.

Question for @jmattsson, @nwf, @HHHartmann, @marcelstoer, etc. Should I do the small fixes for #3193 and merge this into dev first, or just add this big tranche of functionality into this PR. TBH, if someone else was doing this, then my instinct would be to do this as two PRs: complete #3193 and then raise this as a second PR, so maybe I am answering my own Q.

jmattsson commented 4 years ago

If it's not too much extra work, I think my preference would be to do it as two PRs. A bit easier to review too.

TerryE commented 4 years ago

TBH, I started working on this one whilst I was waiting for #3193 and sort of got sucked into it. Time to stash this and draw a line under the pending PR, so let me do this over the w/e.

mk-pmb commented 4 years ago

Great work, thank you!

For the approach with multiple reboots, and connected hardware that will be confused by device reboot, how long will it usually take from first reboot until I can again run code to restore the connected hardware to a safe state? i.e. is it worth considering "restore sanity" phases after/between the reboots?

As for the 1st pass limited RAM, I think this is a problem users can easily solve, by rebooting their application into a minimal-but-safe mode and using that one for the LFS update.

On comments in code vs. whitepapers, maybe we can have the whitepapers include the relevant parts from the source. I'd be willing to help with tooling for the rendering.

TerryE commented 4 years ago

how long will it usually take from first reboot until I can again run code to restore the connected hardware to a safe state?

Of the order of 1 sec.

is it worth considering "restore sanity" phases after/between the reboots?

The LFS load is two pass. The 1st pass is a validation and sizing pass, with nothing written to the LFS. Errors are returned as a return string. We can have a high degree of confidence that the second live pass will work.

I am happy to work with you if you want to actively contribute to project. My email address is in the git logs.

mk-pmb commented 4 years ago

Of the order of 1 sec.

My gut feeling says that this long a downtime might be long enough to require extra safety measures on some kinds of devices connected, but I think we can postpone these aspects until someone describes an actual case where a second of wrong I/O state can cause damage. Even then, it's not a degradation compared to the old feature set, just a case of "it could be even better".

nwf commented 4 years ago

IMHO, if that kind of I/O safety matters, you should be using an I/O expander or other additional peripherals you completely control (possibly just to gate onboard peripherals; e.g., route the UARTs through AND gates). The ESP firmware has its own mind about GPIO things and it is not, AFAIK, promised to be stable across releases.

mk-pmb commented 4 years ago

Yeah. It's a balance between what states the external device can survive for how long, and how much safety equipment you want to afford plugging in between, because that extra equipment will take up extra space, potentially also electricity and probably extra effort in code.

TerryE commented 4 years ago

Let's put this into perspective: if you are using something like an Arduino, then an update to applications code will take seconds. As Nathaniel says if you want truly bumpless control then you are looking at redundancy and safety features in your H/W design. You won't get this out of the box using an IoT module costing a couple of $

TerryE commented 4 years ago

I have just pushed the first tranche in #3272. the ldump/lundump C files are pretty much a complete rewrite and lnodemcu C file has major rework. Why? The previous LFS code and formats were different to the standard dump formats, so t was feasible to attempt a minimum change as far as the dump and undump code was concerned. In general I would describe the core Lua design / coding strategy as: minimal, simple and orthogonal. In general there is almost no attempt at peephole optimisation within the source, but rather the coding style relies on the C compiler optimiser to do this. The fact that the Lua runtime's size and performance exceeds that of PHP7 underlines the effectiveness of this strategy in my mind.

However, in the case of NodeMCU, we have some extra functional and non-functional drivers:

This totality really means that the dump / undump implementation has to be new rather than an incremental, though it does mirror the best concepts of the original code. Whilst I have embraced the minimal, simple and orthogonal principles, in one respect I do differ from the Ierusalimschy camp is that I don't believe in zero comment style. I have include heavy inline commenting to explain the impacts of these constraints on the implementation.

By way of an example that I picked up in testing is that I have to pass a bunch of string parameters from Pass 1 to Pass 2 of the undump and these will eventually become TStrings in the LFS so for hand-over the TString headers are all 0xFF followed by the CStrings; that way the CStrings can be used and the TString headers updated once the ROstrt size is known. Some of these strings can be copied from the old LFS string table, but this get erased before the strings can be written so any LFS CStrings need dupped into RAM before the LFS can be arranged. I also intially planned to force a rapid reboot by calling system_restart() follwed b throwing an error, but this would die horribly if the Lua code that called the node.LFS.restart did so with a pcall and was running out of the old LFS. Lots of testing run.

AS you should notice this version doesn't include the dual LFS nor the host absolue LFS variants, but these requirements have been reflected in the architecture and they are modest incremental additions.

mk-pmb commented 4 years ago

Sounds good to me. :-)

mk-pmb commented 4 years ago

Once we have two LFS regions, will it be possible to write one of them incrementally from a download handler without caching the image in SPIFFS first?

TerryE commented 4 years ago

Good Q, but no.

TerryE commented 4 years ago

@nwf @HHHartmann @jmattsson and anyone else that wants to comment:

Just to flag a mild compatibility break with standard Lua (and that never worked properly on NodeMCU anyway). Standard luac allows "-" as an input or output filename, which defaults to stdin / stdout respectively on POSIX builds, hence you can pipe in and out of luac. Because our load / unload code is common to both host and ESP and is multi-pass to be able to work on the ESP within RAM constraints, this makes allowing stdin / stdout as a valid input or out file a total PITA since you can't rewind pipes.

So for the avoidance of doubt I am going to say that we only support file-based source and compile output for the cross compiler and treat the filename "-" as invalid.

Anyone you wants to add this back in can add this extra tier of complexity and is free to develop the patch to spool to / from a temporary file after this PR is merged.

TerryE commented 4 years ago

Also note that there are other incompatibilities: standard luac always outputs a single compiled top level function. It does this by creating a dummy wrapper function if you specify multiple input sources; this compiles but isn't usable in practice as there is no execution path to access or to bind the individual Protos as closures. We support an output format which allows multiple top level functions (TLFs), so we just output a multi-TLF file.

Also note that there is nothing to stop you "compiling" one or more existing LC files. In our case, these are just aggregated into a compiled LC file (which can be used as an LFS image).

TerryE commented 4 years ago

In terms of the options:

Note that the option -m is no longer supported. This is because an LFS image can be created from multiple LC files. If you want to validate the size of a specific configuration, then create a (temporary) absolute LFS image using the -a option; the size of this file is the size of the corresponding LFS image.

Also note that as well as Lua source, LC files are accepted as input, so

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mk-pmb commented 3 years ago

@stale I want this.

mk-pmb commented 2 years ago

@Stale not helpful.

HHHartmann commented 2 years ago

At least the two regions part would be nice