Policy of supporting Lua in ROM

TerryE commented 7 years ago

Although this issue links to earlier discussions in #1289 and #1661, I see this as a policy issue mainly for the committers, so can you all read this and give your comments so we can move forward on the basis of some form of consensus?

Of the ~45Kb RAM available on the ESP8266, typically half or more of this RAM is Lua compiled code and constant data as opposed to true R/W data. The facility to move Lua binary code in to Flash will more than double the effective RAM available to programmers.

Do we add support for running Lua directly out of Flash?

If so do we add it to the current dev branch soon?

Background

A hierarchy of function prototypes. and their associated vectors (constants, instructions, meta data for debug) are loaded into RAM when any Lua source or lc file is loaded into memory. Because in the Lua architecture, each Proto hierarchy can be bound to multiple closures (this closure creation is only done by executing the CLOSURE statement at runtime), such hierarchies are intrinsically read-only and therefore in principle ROMable.

The main complication here is that, like all other Lua resources, Proto hierarchies are garbage-collectable (and advanced Lua programmers exploit this collection). So IMO, the difficulties arise when devising the details of how any compiled Lua in ROM interacts nicely and stably with the GC: it's fairly straightforward to implement a scheme which work mostly: but we need one which works all of the time in a well determined manner if we proceed with this.

I haven't worked out a robust way of doing an incremental storage system, as Phil discusses in #128, and IMO this will be hard to realise. What I have worked out how to do is essentially an "freeze into flash, then reboot" approach.

Essentially what this does is to maintain a ROM string table and a set of Proto hierarchies in a fixed (64Kb) flash block within the ICACHE_FLASH address space.
Any Lua files that you want to move into ROM must be preloaded into SPIFFS as lua or lc files.
The flash block can be discarded and rebuilt using a new node.rebuild_flash() function supplying a list of lua files that you wanted including into the ROM. This rebuild_flash call should be preferably called just after reboot. This call then either rebuilds flash block and reboots the ESP immediately on completion, or leaves the flash block unchanged and reboots with an error status.
After the reboot, the modules in the flash block are in the require path and so can be executed by a require statement; the loadfile and dofile will also parse rom:module syntax and return or execute a closure accordingly.

Basic approach

This ROM-base lua system uses two string tables as discussed in #1661: the standard RAM-based table and a ROM-based one ahead of it in the search order. The rebuild_flash routine unhooks the current ROM table, and does two passes of loading the modules.
The first pass is a dummy load. This serves to purpose: (i) to calculate the total storage requirement for the Proto hierarchies; (ii) to fill the RAM string table with all the strings needed to store the hierarchies.
The size of the string table and all string resources is then calculated and if the total size of code and string table fit within the flash block, then the strings and string table are copied to flash, then the RAM string table GC pruned. The ROM table is rehooked and the RAM table replaced with an empty table.
The second pass is a flash buffer load, with all string constants resolved against the ROM table. Since all of the strings needed to load the Proto hierarchies are now in ROM, these hierarchies can now persist over reboot, and only the closure-based resources will occupy RAM.

This process is simple and robust, but the Lua RTS is built around the assumption that collectable objects don't move their location and that strings are interned. It will be impossible to return control to the invoking Lua after a successful load, and difficult to return control after a failed one, which is why this "reload flash and immediately reboot" option is the most robust.

This system would enable Lua programmers to be able to compile and execute significantly larger Lua programs within the ESP resources.

There are some extra wrinkles for the Lua 5.3 environment but I will park these for now. So comments so far?

pjsg commented 7 years ago

With the version that I built, I wasn't convinced that you could entirely rom-ify some functions (without a lot of hacking in the GC). The issue is that some objects need to be followed by the GC process, and so cannot be romified. [My memory is a little hazy, and I don't know if you can actually write such a function. My implementation had all sorts of corner cases where some of the internal vectors could not be rommed]

TerryE commented 7 years ago

It's quite straight forward to get the Proto hierarchy only depending on only ROM resources, so you can safely short circuit the GC sweeping here. IIRC, your implementation also pushes closures into ROM, and this is bad news, IMO, because the upvalue chain will invariably point back into RAM.

I'll do the usual trick of adding this code to a vanilla Lua 5.1 version first and hammering it there on the PC before porting it to NodeMCU. It's just a pity that MMAPing a file into an absolute address window with the MAP_FIXED attribute is such a dog on modern OSs kernels which use with address space randomisation. But getting around this is only a few dozen lines of code. At least this way you can hammer out the GC interactions using a decent gdb implementation.

But to my core Qs, if we can get a robust approach here should be push it through dev to master?

nwf commented 7 years ago

As someone who often exhausts the heap, I'd be for such a change, assuming it's not crazy to support. The only downside, other than code complexity, perhaps, that leaps to mind is that running from flash means that it's less clear ahead of time when the flash chip will be engaged, so anyone trying to use the flash-associated GPIOs might be in for a surprise.

Not to ask a stupid question, but does anything in lua 5.2 or 5.3 make this easier? (I recall reading somewhere that there was an effort to bring nodemcu over to 5.3; am I just making that up?)

jmattsson commented 7 years ago

Getting Lua running out of flash would be a big boon given the RAM constraints.

My Qs/concerns are:

Is it forward compatible with the 5.3 work? If not, I think we'd better hold off on this - it'll be a noticeable bump doing that jump without pulling this rug too.
Would the "frozen" flash buffer be in a fixed location, or a logically fixed location (i.e. "at the end of .text" or so)? If the former, would we need to do something to deal with uneven wear? I'd very much like to avoid that if possible. Having the "frozen" flash block just hanging off the end of the firmware would move it about depending on which modules were compiled in at least, which is probably Good Enough(tm).
Presumably there'd be a function to check whether/what's loaded into "frozen" flash? Main use case here being where you pre-build your file system and want to get some things "moved" to flash on first boot, automatically, without ending up in a reboot loop :)
Do you need to say require('rom:mymodule') to look into flash? If not, can we make it so? I could foresee a case where you have a read-only filesystem from which you've "frozen" some code over into flash, and if you then just require('mymodule') you'd need to search the frozen area first, but if we do that then you're likely to end up surprised because you're loading old code even though your .lua (or .lc) has been updated. Of course, it could be argued we're sticking to tradition given we already have the issue with .lc vs .lua...
Instead of a rom:module syntax, would it be possible to hook the "freezer" into the VFS read-only? That way you could simply do require('/freezer/module') or such, and would avoid introducing another namespace.
Are there any alignment changes needed to Proto in order to be able to read them straight from flash? How much impact would that have in cases where the freezer isn't used?

TerryE commented 7 years ago

Sorry for the long reply guys, but I've tried to cover all of Johny's and Nathaniel's below.

Forward compatibility with the 5.3 work

I will do a separate update on the 5.3 work on #1661, but as to the specifics of this functionality, like Johnny I see addressing the RAM constraints a key criteria for the success of NodeMCU Lua, so my original intent was only to add this functionality to 5.3. What I've raised here is essentially a backport of the technology to 5.1. Unlike the rest of the 5.3 work which from a user perspective is either out-of-the-box 5.3 functionality or compatibility with the existing NodeMCU/eLua module API, this is new and pretty decoupled from the rest of the 5.3 work.

Moving it into 5.1 allows a more vigorous engagement with the current developer community to get a consensus on how this API should work -- as well as bringing forward the benefits to the community.

Flash buffer location and wear levelling

I see this as less of an issue for a number of reasons. The current Windbond chips such as the W25QxxFV series quote a life of more than 100,000 erase/program cycles, and even if the modules have the earlier generation NAND flash chips with a 10K cycle life, say, the mode that I suggest we use here which is essentially a reboot-reload-reboot cycle envisages a usecase more similar to the convention C rebuild and flash life-cycle. Even during active development the module might see 10 reloads a day, and in production maybe 1 a month so this isn't going to be an issue. It would also be trivial to consider refinements such as the SPIFFS_FIXED_LOCATION type parameter.

Search order for loading

The Lua require system uses a set of helpers defined in the package.loaders array. These can even be changed at a Lua programming level. Even so I recommend that the default order should be that the ROM should be searched first, for performance reasons, but that in a development mode the developer would be free to do in init.lua

  local pl=package.loaders; pl[3],pl[2] = pl[2],pl[3]

to reverse this search order. And note that you can only specify the module name as a require parameter; it is the loaders (or searchersin Lua 5.3) that determine where to look.

The load functions are different in that these don't have a searcher concept and so we need some simple method of encapsulating accessing the ROM store at a Lua API level. Also accessing the ROM store is fundamentally different from any of the load functions (load, loadfile, loadstring, dofile) as these all execute a load operation which is expensive at runtime. The ROM store contains a set of compiled Proto hierarchies in memory, and that is needed to convert them into a closure (which is a "function" in Lua terms) is to execute the CLOSURE VM statement, and this needs a few lines of C to be executed as Protos are hidden from Lua execution world.

This is why I would promote the use of modules rather than functions in the store, as this is more transparent and fits better into the Lua paradigm. Nonetheless if we want to make a more transparent method of loading functions from the store or VFS whatever we do is going to be slightly a botch because internally within the relevant load function this is't encapsulated within the vfs because you don't actual do a load with a stored routine. It's already loaded, just not bound to a closure.

We have exactly the same issue today with lc vs lua loads. The standard API leaves the handling of this and any precedence issues to the Lua programmer. I would just add rom to the list and still leave it to the programmer. My own standard template is to use an autoloader for my functions to hide all of the error handling and precedence issues. I myself would just extend this with one line:

setmetatable( self, {__index=function(self, func) --upval: loadfile
    func = self.prefix .. func
    local f, msg
    if not skiprom or not skiprom[func] then f = getrom(func) end  -- handle ROM load
    if not f then f,msg = loadfile( func..".lc") end
    if msg then f, msg = loadfile(func..".lua") end
    if msg then error (msg,2) end
    if func:sub(8,8) ~= "_" then self[func] = f end
    return f
  end} )

or actually the ROM optimised version which saves about 300 bytes RAM:

-- skiprom if defined is a global
setmetatable( self, {__index=getrom("autoloader")} )

The getrom function returns nil if the Proto isn't in the store, so this is the cheapest method of checking for existence. However we could also have a debug.getromprotos in the same way we have debug.getregistry though this would have return a list of names since the Proto values are meaningless in Lua.

Performance and alignment issues

Yes access from RAM is roughly 13-25× slower than Flash in the case of cache-miss, but at the moment executing every Lua VM instruction (4 bytes) involves reading 100s of bytes of xtensa instructions from Flash to interpret this one instruction. However, flash access is RAM cached and this reduces the overall impact, though accessing from scattered flash address regions will increase cache fault rates, so IMO moving code into rom will slightly decrease instruction execution performance.

But we also have to balance the slight increase of runtime in accessing rom-based constants and strings with the fact that all of these resources are in ROM and have been removed from the scope of the GC, so GC sweeps will be a lot shorter and required less often. A big runtime saving.

Also, the RAM limitations mean that non-trivial Lua programs involve a lot of dynamic loading of code from SPIFFS which is slow because of the double whammy of SPIFFS overheads and the Lua load process. Converting ROM Protos to encapsulated functions is fast.

So I believe that the average Lua application will run faster overall.

Unaligned (in the Lua RTS nearly all byte) fetches are slow because of the overhead of the unaligned exception handle overhead. -O2 instead of -Os helps for general string access but not for this. However there are (inline macro assembler) techniques we can use to replace unaligned fetches by a two instruction aligned fetch and extract. But I see this as a second order optimisation for later.

nwf commented 7 years ago

Thanks much for the very detailed response!

jmattsson commented 7 years ago

Nice comprehensive response, cheers. Just a minor comment regarding unaligned stuff; I really did mean unaligned (exception code 9), rather than the sub-32bit-wide loads ("load/store error", code 3). The latter we have our custom exception handler to patch up and recover with. Unaligned 32bit access however would still be fatal. It may not be an issue as you say "nearly all byte", but worth keeping an eye on.

I'm in favour of the approach outlined here. Obviously we'd need to have good docs explaining how to use it, when the time comes :)

Ah, and one more question: a 64k block is obviously larger than the amount of free RAM we currently have, and thus would likely go partially unused no matter how badly one tries to move code to it. Any ideas on how to get the best use out of it? I know you ruled out incremental freezes above...

TerryE commented 7 years ago

It's late for me so a quick response. I understand your exception code 9 point and I will check this, but I don't believe that this is an issue.

Re 64kKb, the reason for two passes is to serialise the load process. There are two constraining factors: the size of the string table, and the size of the largest module that you need to load, because each file is loaded into RAM, then cloned into flash. We need to play to see how much of a constraint this is in practice, but whatever it is, it's still a lot better than current constraints.

dtran123 commented 7 years ago

For those concerned with performance, I would argue that one of the key usecases for the ESP8266 (& ESP32) is IoT applications. For the majority of cases that I can think of, fast execution is not critical. So even if this resulted in a slightly slower execution time, I would be ok with it.

I can see great benefit from this feature:

additional RAM to be able to properly support secured tcp/http connections for the ESP8266, additional cyphers and larger key sizes. I am less concerned for the ESP32.
significantly reduce frustrating heap crashes cases resulting in greater adoption of the Lua Nodemcu platform.
simplifying code (many times, ugly lua code has to be used to work around ram limitations)

These are just 3 benefits that completely justifies such initiative.

I would be curious what is the effort estimate for completing this task. Are we talking weeks or months ? (for a developer)

I believe we run the risk of losing many community members for the ESP8266 if we don't solve the RAM shortage which appears to affect proper support of secured connections.

TerryE commented 7 years ago

The issue isn't so much absolute hours, but elapsed. The internals of the Lua engine are both subtle and complex, and I (TerryE) seem to have taken the short straw to get to grips with all this. The issue is that all of this work is unfunded and done in my spare time, threaded amongst my other commitments like finishing off a house that my wife and I are building -- and doing the Home Automation for the same which needs its own ESP code. But as to your core point it's man-days of work (my being a male) rather person-weeks, as a lot of the foundation work is already done as part of my Lua 5.3 upgrade for NodeMCU.

dtran123 commented 7 years ago

Would it help you if some of us were helping you with the funding ? We could gather a few volonteers to help out with donations. I am willing to help if I know this feature will result in fixing the current problem with secured tcp connection that has started with the SDK 2.x. on the ESP8266.

TerryE commented 7 years ago

@dtran123, nah. don't need the dosh. I spent 35 years in IT and ended up on top of the techie shit-heap. I am now a gentleman living on a sinecure (pension). It's hours in the day and priorities that are my problems :wink: Let me crack on, whilst the brain is still working.

nwf commented 7 years ago

@dtran123 Increasing available heap may help with secure SSL, but as I reported in #1707, I think it is already viable to work with (and verify) ECC keys instead of RSA keys. If you control both ends, I think this is the most immediate path forward. You'll have to tweak the mbedtls configuration file as done in https://github.com/nwf/nodemcu-firmware/commit/c1ed48c09a2fafc85e53febf1298ae945da09531 (and likely want to cherry-pick @djphoenix's update to mbedtls first, https://github.com/djphoenix/nodemcu-firmware/commit/4958a4a12a16d91d58ec73652a0b1ddc4df4f6fa) and/or see if @marcelstoer can add some checkboxes to the web builder to achieve the same effect.

@TerryE Please don't take any of that to mean that I amn't rooting for your success. If not donations, perhaps a beverage of your choosing if we're ever in the same place. :)

georeb commented 7 years ago

I want to support this in any way I can. Donations, testing, beer... please let me know if I can help! 👍

TerryE commented 7 years ago

At the moment, I am thinking about work-arounds for some interesting catch-22s thrown up in standard Lua testing. The clone to flash process destructively overwrites the old version of the cloned ROstrt, but this in turn was a clone of and earlier version of the strt, so contains all of the strings like package.loaders keys, and I need these to persist in the same locations, so that loading code itself doesn't fall over. So I can't quite use a simple serial allocator for flash. It's an issue that I will need to address anyway with 5.3, but this is one to solve during some ZZZZZZZ or over a glass of wine, and not in the editor :smile:

nwf commented 7 years ago

Would it help to have two segments of the in-flash data? One which was objects whose positions needed to remain invariant across updates, and one which could be overwritten at will at each clone? I presume the former can be relatively small and so loaded into RAM at the start of cloning, and then written back to flash only after the second segment has been constructed and any requisite additions made?

TerryE commented 7 years ago

Close. My current approach is to treat the first boot after flashing the Lua firmware as special. This is partly to solve some issues in the NodeMCU 5.3 version where you can declare TStrings at compile time. The RTS performs library initialisation then executes a clone before starting Lua execution. This just clones the base string table. The addresses of this first tranche of TStrings are then preserved across subsequent clones, so the tables which use them are OK. Time for bed for me, as I am on UTS+0

TerryE commented 7 years ago

Please see my paper on this approach: LRO Functions in NodeMCU Lua. Sorry, it includes some typos and other errors, but I will fix these if I update it following any review comments.

nwf commented 7 years ago

@TerryE This looks really well thought-out. I very much like the flash-block lifecycle tracking trick (1F -> 7 -> 3) and the multi-reboot design seems like it will work well without being too complicated.

ETA: Is there any way we could compute the flash block on the host as part of the image build? Obviously not exclusively, given the intended node.rebuild_flash() API, but in addition, perhaps?

TerryE commented 7 years ago

I already have a lot of code that can do this in a 5.3 environment, but then again the standard NodeMCU make generates a luac which runs on the host and does the same as the current luac.cross, except for a -X option which allows you to run NodeMCU Lua on the host,. This makes it simpler to implement this, but there's no reason in principle why the equivalent shouldn't be done for 5.1, but let's get the on-chip version working and released first.

TerryE commented 7 years ago

I have been thinking about this cross build issue. It would in principle be a straightforward variation. The way to do this would be to extend luac with an option -F <stringlist> where the string list would just be a text file list of strings to be included in the ROstrt, so

  luac.cross -O flash.bin -F default_strings.txt MyProj/*.lua

might build a flash image for downloading based on the Lua files in MyProj. I couple of wrinkles here: (i) luac.cross uses host-native pointers for its in-memory data references which are typically 64bit these day, not 32, so some resizing would need to be done. (ii) I'd need a relocatable format for these flash binaries. These are both trivial to address, but I don't want to get sidetracked on this just yet.

jmattsson commented 7 years ago

Just throwing another thought into the pot here.

For the modules with mere 512k flash, would it be feasible to exclude the "freezer" support? Would it make sense to have the interface sitting in e.g. a freezer module with load(), isempty() and clear() functions, and then either #ifdef out or use a zero-size page/area for the storage if the module is not enabled? Just thinking that 64k might prevent people from using the old ESP01 modules with modern NodeMCU outwise.

TerryE commented 7 years ago

@jmattsson, I'd already decided to do that (and also make this option disabled by default in early releases, at least) for two reasons: first, so that those that don't want it don't have the flash overhead, and second just in case we find issues in early testing. If we can conditionally remove the code then we can do safely release it into dev.

A second point: what do we formally call this? Philip first proposed the idea and called it "freezer". Do we use that, or do we use "flash"?

TerryE commented 7 years ago

I've got the vanilla PC 5.1.5 version working fine now. This does a mmap() of a pseudo flash area into the VM, and then used mprotect() to turn off write access whilst the VM is running.

I had a bit of fun with the GC as this still tries to mark fixed resources during CG sweeps even if it isn't doing this in string sweeps. This relates to its algo for weak tables, especially kv mode tables that typically used for memoized functions and ephermon tables(see PiL 17.2 and 17.3), but I can't see where strings in Flash being fixed would break the application behaviour. More to the point, I very much doubt that any IoT Lua application would fall foul of this, and having double the RAM available would help sweeten the pill if is does.
The other area is in the cascade clean-up of Proto hierarchies when a closure is GCed, but here the Proto fixing works as intended.

The PC-based version has to support PIC Flash because of Linux address randomisation, and if we start looking at @nwf Nathaniel's suggestion of Host buildable images then we might want to do the same for the NodeMCU version. However, I suggest that we keep the NodeMCU version simple as possible in its first iteration. I am not going to include byte-access optimisation in this first version so it will hit the aligned handler, but adding this as a second pass is pretty straight forward.

One thing that did strike me is that as soon as the VM starts running, the minimal string table is around 10Kb. Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app even if it hasn't freezed any code into the flash.

jmattsson commented 7 years ago

Because this minimal core is moved to the ROstrt, this immediately frees up ~10Kb from a tpyical running Lua app

👍

TerryE commented 7 years ago

I've just been doing an L8UI code review. (This is instruction that trigger 99+% of the unaligned fetch from flash exceptions.) It isn't to bad al all: the 'hot' modules, lobject.c,lstring.c, ltable.c have only 8,7,11 respectively and a couple of simple macro changes will avoid the material ones here, and in one case (luaO_log2()) a reasonable chunk of code replaced by a single asm instruction

lgc.c does a lot of marking and sweeping so accesses here should really avoid byte-based bit diddling in flag byte fields. For example, the compiler generates a 3 instruction sequence to test a single bit: load a byte; shift left the bit into the sign bit and branch if negative -- and this generates an unaligned exception. But the equivalent load 32bit; shift left the bit into the sign bit and branch if negative also takes 3 instruction, executes as fast and doesn't generate the exception.

lstrlib.c has a lot of character based manipulation, especially in the pattern-based searching and matching so ROM-based patterns will be bad news, but it is pretty straightforward to define a macro to clone the string into an alloca()ed copy (if the string is in ROM and less than some safe limit long) and this would be a single line addition per parameter to such hot routines.

But we can do this sort of optimisation once we've got the basic code working.

TerryE commented 6 years ago

Testing this lot is a total bitch. If you use the gdbstub then you can't use uart0 for Lua input or output. So you have to hook up a second USB serial chip to the UART1 and get debug logging that way. I've got two methods of loading code: a RAD cycle based on spiffsimg'ing small 32Kb FS with various test stubs, and potentially a telnet stub, but you've got to get your execution past the basic bootstrapping processes. I am still fighting EGC issues which are subtly different to the standardd Lua, and I am jugging this with all of my other time pressures.

At least the PC version works OK.

It doesn't help that the the gdbstub is very fragile if you can get to a breakpoint then you can examine RAM, but flash-based exceptions just seem to bypass the GDB exception handler entirely, and panic the CPU, so there's no opportunity for PM diagnosis. :disappointed:

Any pearls of wisdom or even sarcasm welcomed :smile:

nwf commented 6 years ago

@TerryE I am afraid I have no wisdom to add, and sarcasm seems like it won't help much. I don't suppose the ESP8266 believes in JTAG?

jmattsson commented 6 years ago

Sure @TerryE , how's "if it was easy then some other idiot would already have done it"? ;)

Is the gdb stub being bypassed because we've already hooked the flash exception? I think Philip changed it so the handlers would chain for anything we didn't handle though, so I might be way off.

TerryE commented 6 years ago

@nwf As far as I know JTAG is a BITE interfacing technology. I've got more than that level of access and diagnostics; as Johny says: if it was easy ... We're working way up the stack here. Philip has already done some extremely valuable ground-breaking to help. It's a balancing act: I accept what we've got for now and work around it, or get sidetracked in improving and integrating the built-in test. What is clear is that we should to do a major rewrite of the Extension Developer FAQ to include stuff like using the gdb stub, logging to UART1, using the mapfile, ...

Yes, Johny, perhaps we need to make the flash exception handler gdb aware in the die branch. I will think about this.

As it stands I have Lua 5.3 working as a NodeMCU host build and ditto the flash variant of std Lua 5.1.5, but bootstrapping this into the ESP8266 just takes time and perseverance. It's just that my other commitments mean that the elapsed time is more that I'd prefer. Luckily I an old enough fart that i've done quite a bit of this low level hacking professionally back in the day, so it's just a matter of dusting off the cobwebs.

nwf commented 6 years ago

Well, JTAG is used for BITE/BIST access to be sure, but some MCUs have mechanisms for opcode-level debugging via JTAG. The Atmel AVRs, for example, IIRC, can be single-stepped and support hardware breakpoints through gdb when connected via JTAG (or their own "DebugWire" interface). There's no need for a gdbstub on the core, and no risk of missing exceptional control transfers.

TerryE commented 6 years ago

Is the gdb stub being bypassed because we've already hooked the flash exception?

Johny, I think that this is the magic Q. gdbstub.c:install_exceptions() sets the exception handler for a bunch of exceptions including EXCCAUSE_LOAD_STORE_ERROR -- which fails silently because this has already been hooked by user_exceptions.c . This handler catches unaligned fetches and returns to the interrupted code after emulating the fetch, but it goes into a while(1) {} loop to force a reboot if not. What the should do is to daisy chain to the gdb handler otherwise.

jmattsson commented 6 years ago

Ah, it looks like even though https://github.com/nodemcu/nodemcu-firmware/commit/2dacec156a8cf1f39c56bfb056e493f36aba50cf introduced the chaining, it continues on to the while(1). Looks like there should be a return after the load_store_handler(ef, cause) call then. After all, if someone else is claiming to handle the exception, we shouldn't have to babysit it afterwards.

jmattsson commented 6 years ago

@nwf I was going to write that there is no known JTAG support for the esp8266, but it seems I would've been wrong to say that.

nwf commented 6 years ago

@jmattsson Intriguing! Still, if it's possible to do all that's needed by the gdb stub, that's likely simpler. :)

pjsg commented 6 years ago

The asm break 1,1 causes entry to the gdbstub (as I recall). You cant continue from it -- I never found a way to continue after an exception (though it ought to be possible somehow).

TerryE commented 6 years ago

Let's ignore JTAG is this issue, please as it isn't relevant to it. As far as the gdbstub / unaligned fetch handler interoperation goes, then surely the main thing to do is to ensure that the two handers interoperate properly?

The GDB stub uses an extended Xtensa OS HAL exception frame structure (see gdbstub_entry.S:L34-L45) and polls back to the remote debugger until a continue command is received in which case it updates the exception frame and attempts to return execution to the application.

The exception handler catches all Flash exceptions does something similar for L8UI, L16UI and L16SI instructions, emulating the instruction. Otherwise it chains to the GDB stub handler and this is where we could be going wrong. I'll have to try some use cases, but if the GDB stub does try to unroll the exception and continue, then shouldn't the unaligned fetch handler honour this and return control.

if (load_store_handler) {
      load_store_handler(ef, cause);
      return;
    }

jmattsson commented 6 years ago

That's precisely what I was trying to say above. Try it!! :)

TerryE commented 6 years ago

Sorry Johny, I am just being slow tonight. I will do :smile:

dtran123 commented 6 years ago

Hello Terry, Have you been able to make some progress ? I think this feature will be a game changer for the ESP8266. My biggest hope is that this will resolve our secured tcp connection issue with the latest SDK version (possibly due to lack of memory).

TerryE commented 6 years ago

I am on holiday at our house on one of the Greek islands at the moment. Along with my laptop and ESP chip. And this break has given me the bandwidth to iron out the final wrinkles.

I'll push an Alpha version in the next day or so.

dtran123 commented 6 years ago

Super ! We appreciate all the work here and owe you a lot of beers ! Enjoy your vacation too I hope :) Will the Alpha version be on the dev branch so some of us can test it out ? (I am not in a position to make my own builds)

TerryE commented 6 years ago

The patch uses a conditional to enable / disable this functionality so that we can quickly and safely add it to dev as this will be disabled by default in the first instance. It is up to @marcelstoer as to whether adds the extra switch to the Cloud Builder so users can request cloud builds with this enabled. We will be updating the dev default once some of the other committers have had a chance to evaluate and test the patch.

marcelstoer commented 6 years ago

Seeing that dev usage on the Cloud Builder is down to 5% (https://nodemcu-build.com/stats.php) I'd send people over to using my, now further improved, Docker image during the transition phase.

dtran123 commented 6 years ago

Thx. I will follow the docker instructions at https://hub.docker.com/r/marcelstoer/nodemcu-build/ to build the image. Once you have the code ready, please give us the details such as which conditional to set, etc. To get us started maybe you could share a link to a build with basic modules included that allows testing of secured tcp connections via tls. That is enough for me to do most of the critical tests.

nwf commented 6 years ago

@dtran123 As discussed in other threads, I wouldn't expect TLS to work out of the box, or at least, not well, even after more heap is made available by storing RO contents in flash. (Though it will certainly improve matters, but IIRC the TLS issues are not exclusively of the "out of heap space" mode.)

TerryE commented 6 years ago

OK, I have an Alpha version working quite well. To do a node.rebuildflash(...) only takes seconds, even if it involves 3 reboots of the chip. Once its loaded, the CPU comes back with ~45K heap available, but the main difference is then at the mo' this then rapidly drops to 20Kb or whatever as soon as you start loading modules. When they are coming from Flash, you only lose heap for genuine R/W variables, so my rough 2x estimate seems to be holding up.

I could push this to my repo, but I've had to do quite a few other hacks to be able to test it. See #1862 for the back-story. I need to make the diagnostics conditional..

On another note, what this also throws up is how the GC hammers default performance, especially GC of strings. Need to think about this further.

If any of the committers want access to the alpha code for their own evaluation, then I will do a push, but @dtran123 et al: sorry guys there's just too much learning curve at the moment for me to push something that you could usefully use :disappointed:

nwf commented 6 years ago

@TerryE I amn't a committer, but I wouldn't object to seeing the bits. Congrats on hitting Alpha; it all sounds very exciting! :)

TerryE commented 6 years ago

I'll have a slightly polished version in a day or so.

TerryE commented 6 years ago

I have spilt the debugging discussions into a separate issue / PR #2146. And if I assume that is commited to dev first, then the LFS (Lua Flash Store) specific changes are as follows:

lua/lflash.c, lua/lflash.h The new module implementing the LFS functionality and the luaN_* entry points.
lua/ldblib.c. Add extra debug functions getstrings and getflashmodules to return an array of string table entries and of modules in LFS.
lua/lgc.c, lua/lgc.h. Prevent the marking or GC of ROM based GObjects. Also some GC diagnostics (to be removed).
lua/lobject.c, lua/lobject.h, lua/lrotable.h. Remove LUA_PACK_VALUE conditionals. eLua option not supported on NodeMCU and also incompatible with LFS. Some changes to macros to prevent the modification of ROM GCObjects. Also inlining a "nsau %0, %1;" instruction to remove an equivalent C function which hit the flash unaligned exception handler badly.
lua/lstate.c, lua/lstate.h. Add initialisation of G(L)->ROstrt plus LFS hook.
lua/lstring.c, lua/lstring.h. Add scanning of G(L)->ROstrt and also some macro tweaks to prevent marking of RO strings.
lua/lua.c and user/user_main.c. Hooks into LFS initialisation and to allow fast reboot from LFS phase without going into init.lua. I also removed some of the #if 0 crud which confused things.
modules/node.c. Added function entries for getflash and rebuildflash to call luaN_getfunction and luaN_setfilelist_reboot respectivey.
platform/common.c, platform/platform.c, platform/platform.h. Extend platform flash interface to enable reservation of flash areas, plus platform_flash_mapped2phys and platform_flash_phys2mapped functions.

I still need to test my variants to ensure that I debug-free version and one without the -DLUA_FLASH_STORE option compile and run as intended. This last options essentially drops the LFS functionality from a build, this allowing us to promote the patch to dev earlier, since the default build will conditionally remove the LFS code.

I've still got a lot of testing and possibly some performance issues to address that testing the current have thrown up. These could significantly improve runtime performance, but even in its current form, the patch in practice allows Lua developers to deploy far larger embedded Lua applications.

nodemcu / nodemcu-firmware