Support for two LFS regions and on-ESP LFS building

TerryE commented 4 years ago

New feature

Additional support for two LFS regions, and the ability to both save and load (update) LFS regions on ESP.

Justification

Most committers and many developers would seem to want this.

Highlights

This enhancement will be for Lua 5.3 only as this builds upon the groundwork that I've already laid in the Lua53 implementation.

Nonetheless for technical reasons this will be a breaking change in terms of compiled Lua file formats. The old formats will not be supported so any image and compiled LC files will need recompiling. However, we will detect and throw an "old version" error if an old format file is used.
LFS and LC file formats will be the same, albeit slightly different to the existing formats (as discussed below).
lua_dump() can now take an array argument as well as a function argument at ToS. If an array is specified then the dump stream will contain all functions in the array.
An extra node.dumpfile() function essentially does the same as string.dump() but writing the dump stream direct to file without needing to assemble and store the compiled content in RAM. (This scales much better.) Both of these functions also support the "array of functions" argument type.
lua_load() (and derived functions) now support the multi-function format, and in which case return a keyed array instead of a single function. The array is of the form {"name" = nameFunc, ...}
loadfile() and node.LFS.reload() can take an array of filenames as well as a single file, in which case the set of files will be loaded into RAM or LFS. In the case of loadfile this will return the {"name" = nameFunc, ...} array if multiple files or multi-function files are loaded. In the case of the LFS reload, this array is not returned but is used as LFS region's index ROTable after restart.
Note that node.LFS.reload() can only process compiled files and will error on any source files.
An RCR option is used to determine whether LFS1, LFS2, LFS1 + LFS2, or LFS2 + LFS1 is used for string and function resolution. Hence a single LFS can be used; either can be used in flip-flop mode; or they can be chained in a System / App configuration.
When used in a chained configuration there is an inherent parent / child precedence implied. (Let's use the terms sysLFS and appLFS for convenience. In an LFS1 + LFS2 configuration:
- The sysLFS is loaded into LFS1 as a parent LFS.
- The appLFS is loaded into LFS2 as a "chained" LFS and this depends on sysLFS. The appLFS can be reloaded without invalidating the sysLFS.
- The appLFS will inherit any strings already in sysLFS ROstrt and hence these TStrings will not be duplicated in appLFS. ~~The ROstrtfor the appLFS also includes all (short) TStrings in sysLFS and hence the sysLFS ROstrt isn't used. The G(L) global state points to this new appLFS ROstrt. From a string lookup perspective there is no difference between a single and a double LFS configuration. (Not true. I do need two ROstrt indexes because overflow chaining depends on the table size and I can't redo these chains in the sysLFS when loading the appLFS.)
- The index ROTable for the appLFS has a methamethod __index entry pointing to the sysLFS index ROTable. Hence resolution across the two ROtables is "free" using standard Lua RTS array access.
- Reloading the sysLFS will invalidate the appLFS and so it will need reloading as well
Change the RCR option does the expected integrity checks on any loaded LFS image and invalidates them when necessary. So for example if both LFS1 and LFS2 contain a parent LFS then the option can be toggled at will. If it is changed to LFS1 + LFS2 then the LFS2 will be invalidated. Etc.
The node.LFS.list property has already been updated to a function which takes an optional argument: 'parent' lists only those functions in the parent LFS and likewise for 'child'; 'system' and 'application' are synonyms for these options. Omitting the argument list all functions in both LFSs if configured as a parent / child pair.

Example usecases

-- Create a (child) LFS based on all LC files in SPIFFS
do
  local f,a = files.list('%.lc$'),{}
  for k in pairs(f) do a[#a+1] = k end
  node.LFS.reload(a)
end

-- Create a new (child) LFS replacing one specfic function from SPIFFS
do
  local v = node.LFS.list('application')
  local a = loadfile {'mysub.lua'} -- use array form to return an array
  for _,n in ipairs(v) do
    if not a[n] then a[n] = node.LFS.get(n) end
  end
  v = {}
  for n,f in pairs(a) do v[#v+1] = f end
  node.dumpfile(v, 'lfs.img')
  node.LFS.reload('lfs.img')
end

How the NodeMCU binary format differs from standard Lua 5.3

In general terms the Lua RTS dump function determinately traverses a Proto hierarchy converting all fields to a stream of binary tokens and this stream is the compiled file format. The load executes an "undump" which does the inverse traverse recreating the Proto hierarchies. This much is the same. But as to why the differences:

File size and RAM usage is a lot more important to us so for example:
- The default line encoding for traceback on the last code example is 192 bytes in the standard RTS but we use a run length encoding that takes 12 byte.
- The standard tokenisation is aimed at simplicity so an integer is represented by 1+4 byte stream. We use a multibyte stream so in our case the constant 1 takes a single byte.
I have also reordered the dump walk so that compiled code on reloading can be written sequentially to the LFS region using spi_flash_write() operations.
When loading one or more Proto hierarchies in a file into LFS we need to add any TString constants that are not already in LFS. I do this by maintaining in RAM a copy of what will become the ROstrt for the LFS. This is a lookup that allows fast resolution against TString clashes, but in case of a clash I still need to compare the new TString against the copy in LFS to differentiate between a true match and a hash duplication. This uses ICACHE resolved access, and so I need to flush the cache to ensure cache coherence. I would rather do this once per file rather than once per proto.
Hence the dump process collects the array of TStrings used in the dump and appends this as a string vector at the end of the file. Any inline TString references use an index into this vector.
The NodeMCU file format includes a fixed header which includes a file CRC and the offset of this TString vector. The undumper fseeks to the TString vector and processes this first before fseeking back to the start of the file to process the Protos. This avoids the need for two passes during dump.
The CRC-32 is at a fixed offset from the start of the file and can be used as an image ID, and node.LFS.verify(image) will check that the dump format is the current version and return this checksum. (Special request from @HHHartmann.) It can optionally checksum the image. The checksum is probably worth doing before reloading LFS

Technical Issues

Cache coherence.

I currently do a botch to flush the ICACHE, and that is to read a sequential 32Kb address window in flash. @jmattsson: Q: do you know a better way?

jmattsson commented 4 years ago

A Cache_Read_Disable()/Cache_Read_Enable() pair should do the necessary register twiddling to reload the cache window into the SPI flash. IIRC it's a single control register in DPORT0, but I don't seem to have a handy definition document sitting around anywhere I can find it right now. Maybe I lost that in the disk crash the other year? Anyway, just make absolutely sure that the call to Cache_Read_Enable() is either present in the instruction cache (i.e. not in what would be the next cacheline) or that you put the call pair explicitly into IRAM :)

HHHartmann commented 4 years ago

Wow, that is one writeup of all your prepartion of this and the descussions about it.

I stumbled across this example:

Example usecases

-- Create a (child) LFS based on all LC files in SPIFFS
do
  local f,a = files.list('%.lc$'),{}
  for k in pairs(f) do a[#a+1] = k end
  node.LFS.reload(a)
end

Does the node.LFS.reload(a) really work incrementally and how do I then reset it? Or is it connected to content of the file multifunction vs. single function?

TerryE commented 4 years ago

A Cache_Read_Disable()/Cache_Read_Enable() pair should do the necessary register twiddling to reload the cache window into the SPI flash.

@jmattsson, thanks. Yup , the SDK SPI routines use Cache_Read_Enable_2() and Cache_Read_Disable_2() and what these do is to is to temporarily turn off the cache without flushing it, and this is clearly gives a lot faster performance, but the kicker is that you can loose cache coherence if you are doing an SPI write to an mapped address range. I also understand that the disable/enable pair need to be executed from IRAM0 and that the enable must have the args (0, 0 , 1) for our configuration, and that is having a 32Kb ICACHE mapping 1 Mb starting at 0x000000. DiUS Ltd will need to tweak this with their OTA fork.

jmattsson commented 4 years ago

Thanks for the heads up. For this particular purpose you might want to deliberately use the ROM'd Cache_Read_Disable()/...Enable() rather than the "improved" SDK versions.

Oh, and I found the reference to the control register: https://github.com/esp8266/esp8266-wiki/wiki/Memory-Map It's got both the flash cache and the iram cache info there, if you prefer to hit it directly.

TerryE commented 4 years ago

Does the node.LFS.reload(a) really work incrementally and how do I then reset it? Or is it connected to content of the file multifunction vs. single function?

@HHHartmann. Future tense: it will work this way, post the PR. That's because there will only be one dump format. A luac.cross image file is just an LC file with all of the modules in it.

This implementation is continually trying to achieve as much scaling as possible whilst working within the ~44Kb heap limits available on a clean restart.

The dump process is pretty lightweight in that it is walking Proto hierarchies in RAM and LFS and serially dumping them to file. The kicker is that I need to collect all of the strings used in the dump and then append them to the dump file.

The simplest way to collect them is to use a Lua array to do this. However, Lua arrays grow their node vector (this is used to store keyed entries) in powers of 2 and the size of a node is 20 bytes on Lua5.3 (2 × TValue + 1 × TValue *) so to grow this to 1K entries you need the old 10Kb vector in RAM and allocate the new 20Kb one to copy into, etc., so the maximum realistic array size that we can hope for on an ESP8266 is 1K, and therefore the number of strings that I can dump in one go is 1,024. If the memory is fragmented then we might only get to 512 strings before getting an E:M error.

How much of a realistic limitation is having a max of 512 string constants per dump? I doubt that many developers would hit this. and the workaround is to split the dump into multiple files.

A slightly more complex alternative (a bit geeky if you weren't into this sort of thing) would be to allocate a custom fixed, say 2K × ~~4-byte, hash vector with a quadratic hash and a packed 12:2:18 structure (ndx, source, iTString).~~ You are storing index, pairs in each bucket and this could just be an {int ndx; TString *} pair but could pack down into a single word given that the max index is 4K, say and the TString is a word aligned offset from DRAM0, LFS0 or LFS1 but the `{int, TString}` version is just easier to code. This would execute as fast and give a 2K string constant limit, but would need a few dozen lines of extra code to implement.

TerryE commented 4 years ago

BTW guys, (and girls if @sonaux is tracking this -- why is IoT such a guy thing? :cry:) I have spend quite a few months brooding about implementation strategies to improve the functionality vs simplicity of implementation trade-off. This approach is about as good as we can get IMO.

TerryE commented 4 years ago

One small implementation issue that I need to address is that in standard Lua the dump and undump processes are strictly serial. And hence Lua uses a lua_Writer abstraction to manage output (with an exact parallel lua_Reader)

typedef int (*lua_Writer) (lua_State *L, const void* p, size_t sz, void* ud);

The type of the writer function used by lua_dump. Every time it produces another piece of chunk, lua_dump calls the writer, passing along the buffer to be written (p), its size (sz), and the data parameter supplied to lua_dump.

The issue is that the dump produces <fixed header><one or more protos><dump of strings> but the undump wants to process <fixed header><dump of strings><one or more protos> which is trivial to implement so long as we can fseek along the input or output stream. To this end I am extending the interface:

the sz parameter is changed to type int with the negative values -1 and -2 having a special meaning and p points to an int buffer.
- sz == -1 set *p to the current byte position in the stream
- sz == -2 set the current byte position in the stream to *p

Hence the write process is \\\\\.

The read process is \\\\\\.

This does mean that the dump and load can only use true files and cannot process streams like stdin and stdout, but I don't view this as a functional limitation.

Anyone see any issues with this?

TerryE commented 4 years ago

When it comes to ESP development cycles, we have to discuss the "elephant in the room":

The limited amount of RAM heap (perhaps 44Kb on an ESP8266) limits the size of application that can be developed using a purely ESP-based development cycle.
Moving compilation and building of LFS off-ESP and onto the host environment significantly increases the scale of applications that can be be developed, but this brings with it all of the issues that especially seem to trouble Windows-based developers of building and executing the cross-compiler toolset.

I want to focus on this first point and how we understand and mitigate the various scaling issues that constrain this life cycle.

The string table (ROstrt) uses a chained rather than an open addressing scheme, with a 2^N sized vector of TString * for the hash. Entry collisions can occur resulting in multiple TString chains (using the next field) and in which case the string resolution algorithm needs to search down the chain for a match. The average performance is roughly as the following table.
- Most string references are made through TValue fields. These contain a resolved pointer to the actual TString, so do not need resolution. Constant TStrings are only resolved once during loading, so runtime resolution is primarily needed when creating a new string at runtime. This is a relatively infrequent activity.
- The Lua RTS performs an automatic halving / doubling strategy to keep the table occupancy between 50 - 100%, so the average number of resolution probes is in the range 1 - 2.
- Unlike open addressing schemes where the probe performance rapidly collapses as the occupancy approaches 100%, this chained scheme degrades gracefully and the number of probes is only about 33% worse if the maximum occupancy is allowed to grow to 200%.
- The hash resizing invalidates the TString chains, so the resizing algorithm needs the old and new copies of the hash in memory and it must walk all of the TString chains to relink them for the new hash. This is practical for RAM-based strings but impractical in the case of write-once flash ROM ones: the final size of the string table must be known before starting to create an LFS.
- The string table hash must be built in RAM during LFS load and copied to LFS on completion. Other components are required in RAM so the largest practical ROstrt could perhaps contain 4K elements (16Kb RAM).

% Occupancy	# Comparisons for hit	# Comparisons for miss
50%	1.25	1.53
75%	1.37	1.75
100%	1.50	2.01
150%	1.75	2.47
200%	2.00	2.98

For LFS-based applications the RAM strt is used largely for runtime build strings. These tend to be highly dynamic and heavily GCed, and so might typically be only be 10% of the size of the ROstrt.
The (LFS) load process has to copy a set of Protos each with associated instruction vector, constants and metadata plus a set of strings into LFS. Any loaded string constants need to be resolved against TStrings at already known addresses in LFS. Each load file contains a set of Protos and a set of strings; the strings must be resolved against the existing ones in the ROstrt (and the sysROstrt for an appLFS) and resolution misses copied to LFS and added to the working hash in RAM. The ICACHE flash must then be flushed before processing the Protos, since subsequent resolution within the Proto structures will need to reference this TString content by addressing through the normal mapped address space. This whole load sequence is largely a file to LFS copy, with buffers and a few key tables like the working ROstrt hash being needed in RAM.
The (RAM) load process shares the same code, excepting that the low level store and access primitives work with normal RAM based resources and the RAM strt.
The dump process is somewhat more complicated than that used in standard Lua. This is because strings are uncollated in the standard Lua format but must be collated on a per file basis to make undump-to-LFS workable. Also in the case of RAM-to-file dumping, both the compiled code structures and any collation management structures must be maintained in RAM during the entire process.
- With the current (host) luac.cross, RAM availability isn't an issue, so we can use standard Lua arrays and other structure and leave their management to the standard table handling. At 20 bytes per table entry, it is unlikely that we would be able to undump any function set with more than 512 string constants in it.
- I personally think that this is too much of a limitation, so I will need to consider more RAM dense options. The simplest IMO is to use an open addressed hash with a linear probe. This will take an N × Tstring * + N × short so a 1.6K constant hash would require 12Kb RAM and we might be able to process up to 4K constant dumps. But only up to say 80% occupancy and no resizing. The hash table doesn't need to be a 2^N size either. All straightforward to implement apart from the issue of how to choose the hash table size. There is no harm in oversizing it so long as there is enough RAM available after doing any compiles / loads needed. I can think of four possible strategies for sizing this:
  - Tough shit, it is what it is: pick a fixed size, and developers need to live with it.
  - Prescan the proto hierarchies to count the number of strings. This will be an oversize since this estimator won't account for repeated strings.
  - Abort on hash full, rewind the output file, realloc and retry.
  - One of the above, but allow a size override.
Whilst these approaches will need some extra coding, the fact that the file format will be the same for both normal LC files and LFS image components also allows some compensating simplification.
As discussed in the previous post, we need to add an fseek operation to avoid a 2 pass process. This being said, when I will have a look at the implementation details, a two pass process might just end up being simpler to code.
At the moment I also support two host LFS formats: one is a shadow ESP format which allows the luac.cross compiler to build an absolute LFS image for host-based provisioning, and the second is a flash emulation albeit with host-native size_t and pointers to allow full host-based testing of LFS functionality. I will probably need to keep these.

Whilst the broad usage is as in the previous post, RAM availability and fragmentation is going to be an issue and as a community I feel that we will need to develop build processes. I feel that for everything apart from small setups, on-ESP LFS reimaging will involve a number of steps carried out immediately after restart accessing some Lua based "build script" and using an RTC memory counter for stepping through the build. Maybe we regard the first iteration and Alpha and subject to change, until we get a handle on how usable it is and what the practical scaling limits are.

jmattsson commented 4 years ago

I'd consider a two-pass approach entirely reasonable (and possible preferred, depending on internals). Either of the hash size options seems acceptable at this stage. We can always revisit later if needed.

nwf commented 4 years ago

FWIW, I think "take as many passes as you need" is a perfectly reasonable approach, especially given the very limited heap space. I could even imagine this being something like flashreload("lfs-%s*.lc") turning into:

reboot the module, verify that the lc files are well-formed (pass 1), and erase the designated LFS partition.
Iterate the lc files building up what will be ROstrt in RAM (pass 2) while ignoring all the code, then commit that ROstrt to the LFS partition. Free the in-RAM ROstrt buffer (and maybe reboot the module again, if it makes your life easier).
Iterate the lc files again (pass 3), this time computing the Proto hierarchy using the now-in-flash ROstrt, committing to LFS as possible to free up RAM.
Commit the final table of LFS modules to LFS and reboot the module back to the application.

TerryE commented 4 years ago

@nwf Thanks for this useful feedback. Let me play a little ping-pong on your points.

We should deprecate the flashreload variant and stick to the node.LFS.reload() form.
I would prefer to avoid denormalising functionality so IMO reload() should only take an array or string argument. I had originally thought in terms of a indexed array only, but I can see that adding any key strings is a pragmatic extension. Hence this would work:
```
node.LFS.reload(files.list("lfs-%w*.lc"))
```
This would also facilitate manipulating the array by allowing the developer to add extra file names or to remove some.
In terns of the passes:
- Pass 1 does a full scan of the files computing the size of the ROstrt. Any files with an invalid CRC or other invalid fields in the individual files would be detected during this scan. In principle, we could add an optional ROstrt_size parameter to allow the developer to specify this. However, this is the only pass which does not change the LFS region, so using this as a pre-update check is still valuable.
- Pass 2 has 2 parts:
  - 2A reads the strings and writes any additional TString records direct to the LFS region, at the same time updating the RAM copy of the ROstrt index. Note that if we have a 4K entry index with 75% occupancy and an average size of 16 + 12 bytes say, then the RAM index is some 16Kb long, but the TStrings written to LFS would total 84Kb or ⅓ of the current maximum LFS region size and double available heap size. These are written on a per file basis with the ICACHE flushed after each file.
  - 2B writes out the Proto hierarchies to LFS. There is no advantage in doing 2A on all files then 2B. It is easier to do 2A then 2B on a per file basis, and if we use fseek to position to the Strings then rewind to the start to process the Protos, then we can do 2A + 2B on a single file open.
The current reload() implementation writes the filename as an RCR record and restarts to ensure a clean environment with minimal heap fragmentation. I now feel that this is not really needed complication, as we need to use on RCR records to pass this context: it is simpler just to restart the Lua VM and RTS, and this allows a malloc'ed structure to be used to pass the file list to the loader. We would still need the final reboot, so this was reload would only involve a single restart.

jmattsson commented 4 years ago

Maybe a dumb question, but if we're doing multiple passes anyway, do we still need to have the string table before the protos? As in, do we need to deviate from the default dump format? The fewer deviations we have from standard Lua, the easier to upgrade in the future (as you undoubtedly know from first-hand experience by now). If I've misunderstood one of the finer points here, do feel free to just point me to a whitepaper or something you've already written :D

TerryE commented 4 years ago

Maybe a dumb question, but if we're doing multiple passes anyway, do we still need to have the string table before the protos?

In a word Yes. That is in terms of processing order. The in-file order can be resolved by the odd fseek.

The current standard Lua undump process is designed to go from a (serial) data stream to randomly addressable memory. The idea of having to use a serial API to program flash memory was never considered as a non-functional requirement (NFR) during file format design. We want to serialise the undump to LFS as much as possible and this creates two design objectives:

Individual resources should be contiguous and complete (so resources shouldn't recursively embed other resources within their dump format byte stream).
All references should be backward. There should be no implicit forward references as these will require non-sequential fixups during transfer to LFS.

The limited RAM issue also means that we can only afford to cache limited resources in RAM, so repeating my previous example, we can afford to keep the ROstrt vector of TString pointers in RAM, but not the TStrings themselves. They mush have been preloaded into LFS and the cache flushed so that they can be directly address for resolution during load.

There is also an issue of density. Our current format is about 50% the length of the equivalent standard Lua compiled file formats. This has non-trivial benefits in saving network transfer times and file system utilisation.

Adding our NFRs and relaying out the file format actually makes the dump and undump processes simpler, and significantly less RAM intensive. The corollary here when we are RAM-constrained is that on-ESP life-cycle applications can be larger.

jmattsson commented 4 years ago

Maybe a dumb question [...]

In a word Yes.

Gotcha :D Carry on! 👍

TerryE commented 4 years ago

@jmattsson, this one might amuse you.

This might be counter-intuitive, but making the serialised (LC) dump format compatible with both writing to LFS and to RAM actually removes a shed load of now redundant code -- for example there is no longer a -f option in luac.cross as the normal LC format produced by -o is the LFS image format. There is only one lua_dump() function and one lua_load() function, etc. Fun, fun, fun!

jmattsson commented 4 years ago

Nice! Always so satisfying to be able to remove code!

TerryE commented 4 years ago

A quick overview of the LFS load algo

Unlike the standard Lua version of undump, this NodeMCU version supports an LFS mode, and so the undump function supports storing Protos hierarchies into one of two targets:

RAM heap space. This mode is a single pass with the Proto structures and TStrings created directly in RAM. All GCObjects are collectable and comply with Lua GC assumptions, so the GC will collect dangling resources in the case of a thrown error.
Flash programmable ROM memory. This is written serially using the flash write API. This mode supports LFS. This is a two pass load with the first pass being a read-only format validation and CRC check. The second pass is hooked in during startup and error are unlikely given pass 1. Any error will abort the pass, leaving a corrupted LFS which is detected and erased on next boot. Any reload will need to be manually retried.

The undump code for both modes is largely shared, excepting the top level orchestration and the bottom level write to RAM/FLASH a supplied CB which differs for the 2 modes. Mode 2 requires that writing of separate resource elements cannot be interleaved, so Proto record processing has been reordered to group resource writes and cache the Proto itself in RAM and to walk the Proto's dependents bottom-up. Other than this Mode 1 is largely as in standard Lua, so doesn't really need further discussion.

On the other hand, Mode 2 supports multiple compiled code files each with multiple Top Level Functions (TLFs) and is able to write to the LFS region in Flash. The LFS load process has two passes:

Pass 1 is executed within the standard execution environment in the callframe below the node.LFS.reload() invocation.
- It parses the input files and checks their CRCs. The compiled code is ignored and not written anywhere.
- A simple open address hash (4K slots) is used to collect and count unique TString hashes of all strings read. There must be enough free RAM to allocate 8 × 2Kb buffers to contain this hash, so it is best to do a reload soon after a restart.
- The hash vector itself is discarded. However, the count is used to size the LFS ROstrt. These entries are 4 byte string hashes so there a very small chance of the odd false match folding two different strings in the count, but this not material in sizing the ROstrt.
- Any error will be thrown and control returned using standard Lua mechanisms.
- If there aren't any errors, then the LFS is erased and a "pass 2 header" with list of files to be loaded (TStrings) is written to the LFS region; the CPU is restarted. The TStrings for the filenames are just added to the ROstrt.
Pass 2. The LuaN_init startup code detects the pass 2 header and enters the pass 2 loader after starting the Lua RTS. (Note that this mode bypasses the app/modules startup hooks so no module initialisation is carried out.)
- A temporary ROstrt is allocated in RAM. This just the N × TString * index. The TStrings themselves will be written directly to Flash.
- Each of the LC files is processed in turn. A vector of, say M, unique TStrings is at the head of each file. A M × TString * lookup index is allocated and these are resolved against the (temporary ) ROstrt (as these might duplicate strings already loaded in previous files, or later in sysLFS); the lookup index is updated and any new TStrings written to LFS. This time as the Proto hierarchies are parsed in the load file, they are written directly to LFS. On completion of each file load, the Instruction cache is flushed.
- The ROstrt is written to LFS.
- The ROTable index of {name = Proto} is written to LFS.
- The flash address pointer is reset and the Flash header is written. (This can validly overwrite the "pass 2 header" within NAND writing rules).
- The Lua environment is closed and the CPU is restarted.

Hence the LFS reload takes 2 restarts. The second is actually optional since we could just restart the Lua environment without a CPU restart. However, the heap will have been fragmented during pass 2 so the restart is prudent.

A power-fail during pass 2 will be detected and result in a fallback startup with a blank LFS. In this case the reload will need to be manually repeated. Given that SPIFFS suffers from worse issues (it doesn't even detect power-fail), doing anything more is over-engineering. IMO.

TerryE commented 4 years ago

I've got the multi-TLF dump and undump working fine. Most of the tuning is around making sure that we don't end up with RAM constraints unnecessarily limiting the size of LFS that can be manipulated and loaded on-ESP. As an example, you broadly have three strategies for sizing the ROstrt:

Add up the size of hash vectors in the N input files and use that. This is likely to a be slight overestimate since some strings will be in multiple input files. This doesn't need any big temporary hash to do this.
Use a fixed open-address hash (say 4k slot) hash to count TStrings (removing duplicates). You need to pick a max size and have enough RAM left when you call the node.LFS.reload() call otherwise it will throw an E:M error.
Use more complicated dynamic resizing algo which uses a similar approach to (2) but grows the hash on demand. This is more complicated to code and in the worst case typically requires 50% more RAM at peak so in practice can scale less well than (2) and still throw an E:M error whilst loading larger LFS images.

BTW the issue isn't so much RAM during the pass 2 which occurs post restart in a clean startup config, but in pass 1 which is called directly from node.LFS.reload() and which validates the load and can return control to the calling Lua app in the case of an error.

In the end given my discussion in my post a couple of weeks ago above which is about sizing the ROstrt, and which observes that the probe performance scales in a very well behaved manner as a function of TString count to ROstrt size, that (1) is the simplest; least RAM use and most robust option. At the moment I have a lot of inline whitepaper style documentation about these decisions in the code, (unlike the rest of the Lua code base which has zero inline documentation and expects the core developer to retro-engineer this from the code itself -- "it does what it is"). I am think about pulling all of this commentary of out the source and moving it into a whitepaper, which we can always make available for anyone interested. This is what I will do unless anyone shouts. :smile:

nwf commented 4 years ago

Honestly, I'd prefer more comments in code. I ascribe to the Donald Knuth "it's a book for humans that happens to have some code in it" approach (and literate programming was robbed of its fair comparison, but that's a separate rant; https://buttondown.email/hillelwayne/archive/donald-knuth-was-framed/ does a good job). If you also want to make a whitepaper, that's fine. :)

jmattsson commented 4 years ago

I'm definitely partial to having solid commentary in the source and have been known to leave big blocks of texts explaining the intent and reasoning. The idea being that once you've read the prose you'll be in a good position to follow the code.

Sure, pure code can be elegant and easily comprehended as well, and doesn't risk the comments going stale, but for any non-trivial source I'm in favour of some in-file documentation. So, if you've already got it there, I wouldn't toss it out without good reason :) That said, I certainly also enjoy reading your in-depth whitepapers!

TerryE commented 4 years ago

This change touches quite a lot of components so coding and testing everything is taking a bit of time. Still it looks like you will be able to build maximal LFS images on ESP, which is a lot better than I anticipated. The biggest constraint is the maximum compile size of a single source module, though I will extend my compilation service to support this.

One thing that I am leaving out of this PR is the two LFS support. I don't see this as high risk but I want to do some performance benchmarking on one ROstrt vs. two. In this second case, the appLFS ROstrt will embed the sysLFS lookup as well. If I confirm the benchmarking issues then I will cover this in a separate post.

TerryE commented 4 years ago

Just a quick update to let those interested that I am not 'on strike', but steadily plugging away at this. The code is basically all there, but testing and debugging the usecases is proving rather complicated. The same dump and undump code has ended up being being a reimplementation, and much a is shared between three save modes: in RAM; in normal LFS; in Absolute LFS. This last one is a real quirk thanks to Johny et al: In the LFSA mode, the pointers are 32-bit and refer to on-ESP addresses, even though the undump code will (typically) be running on a 64-bit PC OS, so size_t is 8 bytes, not 4. Oh yes, to work withing the ESP memory constraints and to avoid caching too much in RAM for on-ESP loads, the dump operations are 3-pass and LFS variant undumps are 2-pass so in some cases whole areas of code need to be dummied out.

The last wrinkle is that all of this also needs to be Lua Test Suite compliant, so TString records get wrapped in a UTString type which is packed out to 8×size_t, and this mustn't break the code. I am now pretty much all of the way there -- at least in the luac,cross -e execution environment and can load, restart and execute LFS images (yes, the same ones as load on the ESP). I've got some other test cases to cover, then I'll move onto ESP testing.

So slow progress, but all still OK.

TerryE commented 4 years ago

$ cat /tmp/d.lua
debug.debug()
$ /luac.cross -e /tmp/d.lua 
LFS image corrupted.

Erasing LFS from flash addr 0x090000 to 0x0cffff
lua_debug> lfsreload{"/tmp/lfs.lc", "/tmp/ftp.lc"}

Erasing LFS from flash addr 0x090000 to 0x0cffff
LFS image loaded
lua_debug> print(lfsindex'telnet',lfsindex'ftpserver')
function: 0x558d56ac2e80    function: 0x558d56acd390
lua_debug> telnet=lfsindex'telnet'()
lua_debug> for k,v in pairs(telnet) do print (k,v) end
open    function: 0x558d56accd90
close   function: 0x558d56acd060
lua_debug> 
$

The host build doesn't have file and net so these modules won't run successfully on a luac.cross -e but at least these can be loaded into LFS and then referenced in code. Still have a number of other main paths to shake down: LFS Absolute mode, on-ESP saving and loading, updates to luac.cross parameters, etc., but the core of the dump / undump code seems to be working well. So still more work to do and the odd gremlins to shake out, but I am now confident that I have addressed the bulk of the technical challenges with getting on-ESP modes working.

Note that the host environment doesn't have node, and hence no node.LFS so my quick host only fix is to add these functions to the baselib.

Question for @jmattsson, @nwf, @HHHartmann, @marcelstoer, etc. Should I do the small fixes for #3193 and merge this into dev first, or just add this big tranche of functionality into this PR. TBH, if someone else was doing this, then my instinct would be to do this as two PRs: complete #3193 and then raise this as a second PR, so maybe I am answering my own Q.

jmattsson commented 4 years ago

If it's not too much extra work, I think my preference would be to do it as two PRs. A bit easier to review too.

TerryE commented 4 years ago

TBH, I started working on this one whilst I was waiting for #3193 and sort of got sucked into it. Time to stash this and draw a line under the pending PR, so let me do this over the w/e.

mk-pmb commented 4 years ago

Great work, thank you!

For the approach with multiple reboots, and connected hardware that will be confused by device reboot, how long will it usually take from first reboot until I can again run code to restore the connected hardware to a safe state? i.e. is it worth considering "restore sanity" phases after/between the reboots?

As for the 1st pass limited RAM, I think this is a problem users can easily solve, by rebooting their application into a minimal-but-safe mode and using that one for the LFS update.

On comments in code vs. whitepapers, maybe we can have the whitepapers include the relevant parts from the source. I'd be willing to help with tooling for the rendering.

TerryE commented 4 years ago

how long will it usually take from first reboot until I can again run code to restore the connected hardware to a safe state?

Of the order of 1 sec.

is it worth considering "restore sanity" phases after/between the reboots?

The LFS load is two pass. The 1st pass is a validation and sizing pass, with nothing written to the LFS. Errors are returned as a return string. We can have a high degree of confidence that the second live pass will work.

I am happy to work with you if you want to actively contribute to project. My email address is in the git logs.

mk-pmb commented 4 years ago

Of the order of 1 sec.

My gut feeling says that this long a downtime might be long enough to require extra safety measures on some kinds of devices connected, but I think we can postpone these aspects until someone describes an actual case where a second of wrong I/O state can cause damage. Even then, it's not a degradation compared to the old feature set, just a case of "it could be even better".

nwf commented 4 years ago

IMHO, if that kind of I/O safety matters, you should be using an I/O expander or other additional peripherals you completely control (possibly just to gate onboard peripherals; e.g., route the UARTs through AND gates). The ESP firmware has its own mind about GPIO things and it is not, AFAIK, promised to be stable across releases.

mk-pmb commented 4 years ago

Yeah. It's a balance between what states the external device can survive for how long, and how much safety equipment you want to afford plugging in between, because that extra equipment will take up extra space, potentially also electricity and probably extra effort in code.

TerryE commented 4 years ago

Let's put this into perspective: if you are using something like an Arduino, then an update to applications code will take seconds. As Nathaniel says if you want truly bumpless control then you are looking at redundancy and safety features in your H/W design. You won't get this out of the box using an IoT module costing a couple of $

TerryE commented 4 years ago

I have just pushed the first tranche in #3272. the ldump/lundump C files are pretty much a complete rewrite and lnodemcu C file has major rework. Why? The previous LFS code and formats were different to the standard dump formats, so t was feasible to attempt a minimum change as far as the dump and undump code was concerned. In general I would describe the core Lua design / coding strategy as: minimal, simple and orthogonal. In general there is almost no attempt at peephole optimisation within the source, but rather the coding style relies on the C compiler optimiser to do this. The fact that the Lua runtime's size and performance exceeds that of PHP7 underlines the effectiveness of this strategy in my mind.

However, in the case of NodeMCU, we have some extra functional and non-functional drivers:

Unlike earlier versions where only the undump function had to be memory-lean, we now need to be able to dump and undump binary file / LFS regions up to a goal of 256b using less than 40Kb work RAM.
To do this I've unified the LC and the LFS image into a single fie format. The string constants in a dump are aggregated as a header to the file.
This means that the undump has to be a two-pass operation (1) validation and sizing; (2) load into fixed size tables.
Likewise the dump is three pass: (1) sizing; (2) dump of strings; (2) dump of Proto hierarchies.
The code must be able to execute on both the ESP target environment and the host (both 32 bit and 64 bit) environments. In the case of the host environment, this also emulates LFS within its execution environment.

This totality really means that the dump / undump implementation has to be new rather than an incremental, though it does mirror the best concepts of the original code. Whilst I have embraced the minimal, simple and orthogonal principles, in one respect I do differ from the Ierusalimschy camp is that I don't believe in zero comment style. I have include heavy inline commenting to explain the impacts of these constraints on the implementation.

By way of an example that I picked up in testing is that I have to pass a bunch of string parameters from Pass 1 to Pass 2 of the undump and these will eventually become TStrings in the LFS so for hand-over the TString headers are all 0xFF followed by the CStrings; that way the CStrings can be used and the TString headers updated once the ROstrt size is known. Some of these strings can be copied from the old LFS string table, but this get erased before the strings can be written so any LFS CStrings need dupped into RAM before the LFS can be arranged. I also intially planned to force a rapid reboot by calling system_restart() follwed b throwing an error, but this would die horribly if the Lua code that called the node.LFS.restart did so with a pcall and was running out of the old LFS. Lots of testing run.

AS you should notice this version doesn't include the dual LFS nor the host absolue LFS variants, but these requirements have been reflected in the architecture and they are modest incremental additions.

mk-pmb commented 4 years ago

Sounds good to me. :-)

mk-pmb commented 4 years ago

Once we have two LFS regions, will it be possible to write one of them incrementally from a download handler without caching the image in SPIFFS first?

TerryE commented 4 years ago

Good Q, but no.

TerryE commented 4 years ago

@nwf @HHHartmann @jmattsson and anyone else that wants to comment:

Just to flag a mild compatibility break with standard Lua (and that never worked properly on NodeMCU anyway). Standard luac allows "-" as an input or output filename, which defaults to stdin / stdout respectively on POSIX builds, hence you can pipe in and out of luac. Because our load / unload code is common to both host and ESP and is multi-pass to be able to work on the ESP within RAM constraints, this makes allowing stdin / stdout as a valid input or out file a total PITA since you can't rewind pipes.

So for the avoidance of doubt I am going to say that we only support file-based source and compile output for the cross compiler and treat the filename "-" as invalid.

Anyone you wants to add this back in can add this extra tier of complexity and is free to develop the patch to spool to / from a temporary file after this PR is merged.

TerryE commented 4 years ago

Also note that there are other incompatibilities: standard luac always outputs a single compiled top level function. It does this by creating a dummy wrapper function if you specify multiple input sources; this compiles but isn't usable in practice as there is no execution path to access or to bind the individual Protos as closures. We support an output format which allows multiple top level functions (TLFs), so we just output a multi-TLF file.

Also note that there is nothing to stop you "compiling" one or more existing LC files. In our case, these are just aggregated into a compiled LC file (which can be used as an LFS image).

TerryE commented 4 years ago

In terms of the options:

Unchanged: -l, -o,-p,-v, --. These options are as per standard luac, except that the default output file is out.lc
Dropped: -f,-F. The LFS formats are now unfied with normal LC file formats, so there are no "special" LFS flash formats.
Extended: -s. This set the default stripping option 0 = no stripping, 1 = strip local and upvale metadata but retain line numbers. 2 = strip all debug metadata. Omitting -s is equivalent to option 0; -s without a level digit is equivalent to option 2. We recommend -s 0 for LFS modules.
NodeMCU extension: -a addr. Instead of an LC file, luac.cross outputs an LFS image based at the specified address. The -o must also specify the image name.

Note that the option -m is no longer supported. This is because an LFS image can be created from multiple LC files. If you want to validate the size of a specific configuration, then create a (temporary) absolute LFS image using the -a option; the size of this file is the size of the corresponding LFS image.

Also note that as well as Lua source, LC files are accepted as input, so

luac.cross -p -l -l myprog.lc will give a full code listing of myprog.lc
luac.cross -o johny.img -a 0x4029C000 mylib.lc mysub1.lua will give generate an LFS image that can be flashed to an LFS region at ESP address 0x4029C000, flash address 0x9C000. It is up to the developer to ensure that this address matches the corresponding LFS region in the partition table.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mk-pmb commented 3 years ago

@stale I want this.

mk-pmb commented 2 years ago

@Stale not helpful.

HHHartmann commented 2 years ago

At least the two regions part would be nice

nodemcu / nodemcu-firmware