nodemcu / nodemcu-firmware

Lua based interactive firmware for ESP8266, ESP8285 and ESP32
https://nodemcu.readthedocs.io
MIT License
7.67k stars 3.12k forks source link

Uploading LFS.img with ESPlorer does not work anymore #2963

Closed HHHartmann closed 4 years ago

HHHartmann commented 5 years ago

Expected behavior

Uploading LFS image using ESPlorer should work.

Actual behavior

The device sometimes reboots and ESPlorer displays

> Uploading to ESP file luac.out...Waiting answer from ESP - Timeout reached. Command aborted.

Test code

This seems to be a problem for binary files only, as a 6K binary file cannot be uploaded but a 60K textfile can.

The code used by ESPlorer to receive the file looks like this:

_up=function(n,l,ll) local cs = 0 local i = 0 print(">".." ") uart.on("data", l, function(b) i = i + 1 file.open("Large File.txt",'a+') file.write(b) file.close() cs=0 for j=1, l do cs = cs + (b:byte(j)*20)%19 end
uart.write(0,"~~~CRC-".."START~~~"..cs.."~~~CRC-".."END~~~") if i == n then uart.on("data") end if i == n-1 and ll>0 then _up(1,ll,ll) end end,0) end file.remove("Large File.txt")

_up(270,250,145)

It then sends n blocks with the length of l and one last block with length ll.

NodeMCU version

NodeMCU 3.0.0.0 branch: dev commit: f3044e1aff432ca48f6511554b317e0c80207d1c release: 2.0.0-master_20170202 +451 release DTS: SSL: false build type: float LFS: 0x0 modules: adc,bit,dht,file,gpio,i2c,mqtt,net,node,ow,spi,tmr,uart,wifi build 2019-11-13 07:09 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)

Hardware

Generic Dev board ESP8266SXD

HHHartmann commented 5 years ago

More testing revealed that uploading of nul (0x00) bytes does not work properly. All other char codes seem to work.

TerryE commented 5 years ago

Gregor, thanks for this analysis. I'll take a deep dive into this tomorrow and get back to you. I think that this is an SDK 3.0 issue or at least the unification I've done so that the libraries can access the Lua 5.1 and 5.3 VMs with a common API. This big bitch with the pre SDK 3.0 runtime was that it was just far too easy to overrun the input UART, so with SDK 3.0, I've introduced the stdin and stdout pipes for buffering. This seems to have broken something.

matsievskiysv commented 5 years ago

I think that this is related to #2964.

TerryE commented 5 years ago

OK, this is is subtle and nasty. @HHHartmann @jmattsson, I would appreciate your thoughts here.

What we get as a result of this approach is edge artifacts, as the input stream is toggled between the Lua and application modes without sync pauses between the modes -- as is the case with ESPlorer. Interleaving piped input and direct input doesn't work well.

So using this example ESPlorer first uploads a 380byte / 2 line function _up(n ,bs, ls) which will split a file into n chunks of bs bytes, except the last potentially short chunk of ls bytes and append it to the output file chunk-by-chunk. If a 600 byte file was being uploaded then it might then send:

_up(3,250,100)\n<chunk1><chunk2><chunk3>

and the handover fails. Need to think this one through.

matsievskiysv commented 5 years ago

This is a cut down version of nodemcu-uploader. Basically, this uses the same approach.

import serial
from time import sleep

RCV = r"""
function recv_block(d)
  if string.byte(d, 1) == 1 then
    size = string.byte(d, 2)
    uart.write(0,'\006')
    if size > 0 then
      file.write(string.sub(d, 3, 3+size-1))
    else
      file.close()
      uart.on('data')
    end
  else
    uart.write(0, '\021' .. d)
    uart.on('data')
  end
end
function recv_name(d) d = d:gsub('%z.*', '') file.remove(d) file.open(d, 'w') uart.on('data', 130, recv_block, 0) uart.write(0, '\006') end
function recv() uart.on('data', '\000', recv_name, 0) uart.write(0, 'C') end
"""

LIST = r"""
for i,j in pairs(file.list()) do print(i) end
"""

CLEAR = r"""
for i,j in pairs(file.list()) do file.remove(i) end
"""

TIMEOUT = 0.3

with serial.Serial("/dev/ttyUSB0", 115200) as s:
    s.write(CLEAR.encode()) # remove files
    sleep(TIMEOUT)
    s.write(RCV.encode()) # upload code
    sleep(TIMEOUT)
    s.write(b'recv()\n') # enter uploading mode
    sleep(TIMEOUT)
    s.write(b'file.lua' + b'\000') # filename
    sleep(TIMEOUT)
    s.write(
        b'\001' + # ack
        b'\004' + # size
        b'test' + # data
        b'\000' * 123 # padding
    )
    sleep(TIMEOUT)
    s.write(
        b'\001' + # ack
        b'\000' + # size
        b'\000' * 128 # padding
    )
    sleep(TIMEOUT)
    s.read_all() # flush buffer
    s.write(LIST.encode()) # list files
    sleep(TIMEOUT)
    print(s.read_all())
HHHartmann commented 5 years ago

Looking at the _up() code snipped I posted above it first sends a > which probably is checked for by ESPlorer. So I assume it is esp: > ESPlorer: 1st 250 byte chunk esp: ~CRC-START~45326~CRC-END~ ESPlorer: 2st 250 byte chunk ... ... NOTE: the given crc number is only a sample.

My tests also showed that it is rather related to sending a null character. I tested with 256 byte long files containing all characters and then started to bisect the file. In the end it filed when there was a null character in the file and succeeded otherwise.

It also might be possible that we have different versions of ESPlorer, as mine uses _up(n,l,ll) and yours seems to use _up(n ,bs, ls)

TerryE commented 5 years ago

Gregor, you need to apply the patch that I discussed on the other issue https://github.com/nodemcu/nodemcu-firmware/issues/2964#issuecomment-554679666 to fix the \0 issue, and ESPlorer emits as you say _up(n,l,ll); it's just that p2 is the block size and p3 the last block size.

Picking up @seregaxvm's example, there are critical points where you need to sleep at least enough to let the UART so that the SDK tasking can be scheduled, but there is little point in going overboard.

Even so, if we can achieve terminal buffering but still implement on('data', ...) in a way that doesn't require an up issue of the tools, would be good. I'll get back to you tomorrow.

jmattsson commented 5 years ago

I'd like to understand more about this null byte issue first and foremost.

Other than that, it's probably not a terrible idea to have a high water level for the input pipe, which if reached posts a high priority run-the-vm message (on top of the earlier low-priority messages).

TerryE commented 5 years ago

Johny, see the above link for the fix, re null byte. Dicking around with priorities dynamically only has very limited benefit in the non-preemptive SDK scheduler. The current config works really well for interactive use and cut & paste. IMO, we should just pass all input through the stdin pipe.

The current mix of piped interactive feed and synchronous data feed doesn't work with legacy uploaders such as the one built in to Esplorer.

I'll work up something tomorrow and post it.

jmattsson commented 5 years ago

Ah, sorry Terry, my mailer hadn't shown me your response. Good pickup on that uninitialised variable issue.

I was under the impression that with your major 5.1/5.3 work all input was going through the stdin pipe. If it isn't, then I definitely agree that it should.

TerryE commented 5 years ago

@HHHartmann @seregaxvm, as an aside the downloaded stubs for both the Python nodemcu-uploader and the Java ESPlorer implementation could really do with reworking; e.g. in this second case the code does a file open, append, close per block; this strategy badly fragments the SPIFFS pages compared to just leaving the file open and writing to it serially, in which case the FS buffers the data stream and writes out complete pages. Updating the Python is straightforward, so long as Peter or one of his committers agrees to commit the PR, but doing the Java build is a PITA.

Always feeding the UART input through the stdin pipe is reasonably straight forward, but we do have to deal with the two input modes: text for Lua command input (this must correctly process \, \ and \) and the raw mode for UART input.

matsievskiysv commented 5 years ago

This is weird. After fresh build from dev (and application of patch https://github.com/nodemcu/nodemcu-firmware/issues/2964) file uploading broke yet again. Now hardware watchdog reset fires off during upload.

matsievskiysv commented 5 years ago

When on commit f3b49849d2ab38b8e, after application of patch #2964:

diff --git a/app/modules/uart.c b/app/modules/uart.c
index 2b1f9164..116a323f 100644
--- a/app/modules/uart.c
+++ b/app/modules/uart.c
@@ -31,6 +31,7 @@ static int l_uart_on( lua_State* L )
   if (lua_type( L, stack ) == LUA_TNUMBER)
   {
     data_len = luaL_checkinteger( L, stack );
+    end_char = (int16_t) ~0;
     luaL_argcheck(L, data_len >= 0 && data_len < LUA_MAXINPUT, stack, "wrong a
rg range");
     stack++;
   }
@@ -57,7 +58,10 @@ static int l_uart_on( lua_State* L )
     luaL_unref(L, LUA_REGISTRYINDEX, uart_receive_rf);
     uart_receive_rf = LUA_NOREF;
   }
-  input_setup_receive(uart_on_data_cb, data_len, end_char, run_input);
+  if (uart_receive_rf == LUA_NOREF)
+    input_setup_receive(NULL, 0, 0, 1);
+  else
+    input_setup_receive(uart_on_data_cb, data_len, end_char, run_input);
   return 0;
 }

Using simple script to test uart module:

import serial
from time import sleep

A = r"""
function foo(x)
   uart.write(0,"t")
end
uart.on("data", "c", foo, 0)
"""

s = serial.Serial("/dev/ttyUSB0", 115200)
s.write(A.encode())
sleep(0.1)
s.read_all()
s.write(b"12345abc")
sleep(0.1)
print(s.read_all())
s.close()

Callback foo is expected to run when character c is encountered in the input stream. So, character t should be returned only once. However, this script returns b'tttttttt', which means that foo was called for every character in stream.

matsievskiysv commented 5 years ago

btw, why do we cast values to int16_t before assigning char end_char in https://github.com/nodemcu/nodemcu-firmware/blob/dev/app/modules/uart.c#L26-L41?

matsievskiysv commented 5 years ago

This is also strange.

lua_pushvalue(L, stack); // <- top of stack
luaL_unref(L, LUA_REGISTRYINDEX, uart_receive_rf);
uart_receive_rf = luaL_ref(L, LUA_REGISTRYINDEX); // reference function from top of stack

We are making reference to value 3 via luaL_ref for some reason. We actually need to pop 4th value from stack before making reference to function, not push another.

TerryE commented 5 years ago

btw, why do we cast values to int16_t before assigning char end_char in https://github.com/nodemcu/nodemcu-firmware/blob/dev/app/modules/uart.c#L26-L41?

Historic reasons. Valid terminators are 0..255, but (int16_t) ~1 is used on fixed length reads so that no false match occurs.

The boss is visiting family tomorrow for a few days, so I will have time to sort this. The temporal logic of this needs carefully thought through.

TerryE commented 4 years ago

Except that I did my back in by not warming up enough before yoga -- hazards of old age :rofl: -- so have been rather distracted by a lot of pain.

Sergey, the use of the Registry for persistent Lua collectable references is a pretty standard coding pattern. You have to reference Lua collectables somewhere if you want then to persist across task invocations, and that they need a reference chain back to the environment _G or the Lua registry otherwise the Lua CG will collect them. This is old code BTW and I am not sure why the ToS was dupped as the Func TValue on the ToS isn't used again, a lua_settop() could have been used instead. It doesn't do any harm as the ldo.c wrappers balance unrolling the stack as part of the function call epilogue.

However, your inference is correct: correctly balancing the Lua stack across all code paths -- or the failure to do so -- is one of the major PITAs in Lua C API development and also an aspect where the elf gdb utility is extremely useful.

One slight complication here is that the implementation has to be common across both Lua 5.1 and Lua 5.3 as the rest of the ecosystem is shared.

TerryE commented 4 years ago

The functional requirement here is subtle. @jmattsson Johny and I discussed previously the idea of unifying stdin pipe for both interaction and application (uart.on('data', ... )) input, but on reflection, there are gotchas here. For example one use case that you might not want to break is where the Lua interactive thread is being fed from a telnet session or the like but the application itself is using a serial API over the UART to 'talk' to some intelligent peripheral. This is a valid use and one that we wouldn't want to break. Whatever we do should work for all of the use cases that we want to support, but it we also need to avoid backwards compatibility breaks for the common uploader tools that our community use.

jmattsson commented 4 years ago

As someone who does use the UART for talking to other devices rather than a human operator, there's still the run_input argument to uart.on(), though I'll admit it's often been easier to leave it as default and expose a set of Lua functions for the other-end-of-the-uart device to make use of. Well, when I have control over the other device, that is :)

If run_input is zero, then we naturally can't throw the data into the stdin pipe, but otherwise there's no conceptual argument against it, I think? Whether we should have an actual "data" pipe that's used for the run_input==0 case is a more interesting question I'd say (and the instinctual response from me is "yes, probably").

TerryE commented 4 years ago

Functional requirements for (serial) input handling

  1. We have two sources for input: the UART 0 Rx byte stream and the run application (e.g servicing a telnet session via a socket connection and node.input()
  2. We have two sinks for said input: the Lua interpreter and the app (a uart.on('data,...) CB)
  3. We want the Lua interpreter input to be paste friendly, but this in turn mandates that the input is buffered since the delays in processing of line-by-line chunk compile will result in Rx buffer overrun even for small pastes (~250 chars).
  4. We need the overall implementation to be non-blocking task friendly. We have to work within SDK tasking constraints.
  5. The Interpreter input is line-edit enabled: \ works as expected and \\ (and other variants) folded down to a single \ delimiter.
  6. We want to maintain a level of backward compatibility with existing 3rd-party uploaders, e.g. the ESPlorer uploader. These typically download and execute a small Lua stub to receive a chunked binary byte stream. In the case of ESPlorer there is no sync delay between the Lua execution of the receiver stub and the first byte of the first data block.

As it stands these requirements are mutually incompatible. For example (3) means that we need to pass the input stream through a buffered pipe, so we might end up with (input L1A)(incomplete compile L1A)(input L1B)(input L2)(compile L1A+L1B, execute)(input L3), etc. However executing chunk L1A+B might switch to data mode so L2 must be processed in binary mode, so for example any \ must be passed through and not treated as a functional backspace.

One option here might be to pass all input characters into the pipe, but there are use cases where the application would need to decouple network input from the uart pipe (e.g. using telnet to develop a Lua application which uses the UART for a serial API to a peripheral device.

Input modes

I think that the way to square this circle is acknowledge that we have two broad use cases or modes for interfacing to the interpreter:

The tl;dr is that my SDK 3.0 introduction of input buffering broke this second mode for 2 out of the 3 common UART-based uploader tools. The obvious work around is to add a uart mode which makes the buffered input optional so that the default boot mode is compatible with the assumptions built into ESPlorer, etc.

TerryE commented 4 years ago

Day 10 of being stuck flat in bed, but at least the pain has largely gone, so I can at least think straight. I have been brooding about how to approach this properly. We have 7 component files or "agents" in some sense involved in delivering this input / UART direction:

This underlines that most of the functionality is common to both Lua versions, and to me at least, it underlines that we should seek to abstract from the Lua implementation all functional aspects that are common to both. Both lua.c implementations are very different to the "out of the box" variants, but that is because they must work within an event-driven tasking environment, but again the non-functional requirements here are the same for both Lua 5.1 and Lua 5.3 so we can abstract these out of lua.c.

I will strip this pipe and ESP-specific functionality out of the 2×lua.c and I therefore need to move it somewhere. The driver/input.c file seems the best home for this. Maybe this file doesn't belong in drivers since it isn't really a driver but the only alternative directory for such platform-specific stuff is the platform folder. Maybe it would a better fit there, though we can always do any directory tidying as a separate exercise (we have similar issues with task.c).


Picking up my previous comment, I think that pipe input should be disabled by default so that we can reimplement backwards-compatibility for ESPlorer, etc. We can add a uart.buffer(boolean) API call to allow developer to choose which they prefer.

@jmattsson, How well does ESPlorer handle uploads on the ESP32?

matsievskiysv commented 4 years ago

We can add a uart.buffer(boolean) API call to allow developer to choose which they prefer

Maybe control buffering via uart.on method argument? data for unbuffered and code for buffered.

jmattsson commented 4 years ago

@jmattsson, How well does ESPlorer handle uploads on the ESP32?

I haven't actually used that, but last time I saw someone try to use it, it crashed badly in the file module. I think that was due to the stack corruption we discussed a few months ago, but I haven't been back to check on it. I'm split between two projects at work, and multiple sub-projects within each, and none of which are close to an ESP at the moment :( I wish you continued good recovery, @TerryE .

TerryE commented 4 years ago

@seregaxvm,

Maybe control buffering via uart.on method argument? data for unbuffered and code for buffered

If you trace through the ESPlorer java code etc., it assembles the Lua interactive _up() function call followed by 10mSec delay then issues first upload binary block. It then issues each successive binary block on receipt of a valid CRC return. This is all very time critical since what the lack of handshake between the <Lua _up()> record and the first data block means that the lua record must be processed so that the on.data() can be executed before input.c processes the next record, so adding an extra argument to on.data() doesn't help.

As you mentioned above I can easily create a stripped down version of the python tool and even a variant which reengineers the ESPlorer algo, as I can rapid prototype with Python. And this is what I will have to do.

HHHartmann commented 4 years ago

Gregor, you need to apply the patch that I discussed on the other issue #2964 (comment) to fix the \0 issue

@TerryE If I read the code correctly it fixes the \0issue but only to exchange it with a \255 issue. @input,c around Line 187

        if(  ins.line_pos >= ins.len ||
            (ins.data_len > 0 && ins.line_pos >= ins.data_len) ||
             ch == ins.end_char ) {
          ins.uart_cb(ins.data, ins.line_pos);
          ins.line_pos = 0;
        }

the check ch == ins.end_char is unconditional and so calls the cb for ch = 255 It should be changed to (ins.data_len == 0 && ch == ins.end_char )

TerryE commented 4 years ago

Mea culpa. Since end_char was only used in one routine l_uart_on() in the modules/uart.c I moved into this scope and somehow changed it from unit_16 to char. I need to do a review of these changes :unamused:

PS. Not quite correct, checking my notes. If the len is non-zero then the char is ignored. Still needs a cross module review.

PPS. The syntax uart.on(method, [number/end_char], [function], [run_input]) is flawed IMO, because this (and the module code) allows uart.on('data', func) as valid syntax but the handling is undefined.

TerryE commented 4 years ago

OK, looking at the input.c logic, we have the following combinations:

len delim runflag Action
0 ign 0 Call CB with each input char
n>0 ign 0 Call CB with string of n chars
-1 chr 0 Call CB with string delimited by \
ign ign 1 Call CB with string delimited by valid EOL char and then call Lua interpreter

Note that in this last case if you specify the run flag set then the delimiters are validated but then ignored. In the third case above, in the expression

ins.line_pos >= ins.len ||
(ins.data_len > 0 && ins.line_pos >= ins.data_len) ||
ch == ins.end_char 

this is equivalent to buffer overflow || fixed len reached || end char matched. These last two should be exclusive so as you say the end char test should be guarded by a data_len test:

        if(  ins.line_pos >= ins.len ||
            (ins.data_len >= 0 && ins.line_pos >= ins.data_len) ||
            (ins.data_len  < 0 && ch == ins.end_char )) {

Setting end_char to ~0 if the len is specified is irrelevant. Sorry

HHHartmann commented 4 years ago

Currently in app/modules/uart.c in the case of a character given the following code is executed. need_len = 0; So to meet the test above this would need to be set to -1. Why not just leave it at 0 and change the test to

        if(  ins.line_pos >= ins.len ||
            (ins.data_len > 0 && ins.line_pos >= ins.data_len) ||
            (ins.data_len  <= 0 && ch == ins.end_char )) {

Ah thinking a bit longer leads me to using -1 for data_len which also would cover the uart.on('data', func) case that you pointed out. Alternatively we could also raise an error if none or both are specified.

Could we get this in dev as a separate PR?

HHHartmann commented 4 years ago

Oh no, I looked at the master version. In dev it is need_len = -1.

So I will prepare a PR tomorrow with the updated condition as you stated above.

TerryE commented 4 years ago

Gregor, I am working through a load of tidy up in this area as we discuss this, and I've already made these changes in my working copy. It will really complicate things if two of us are doing PRs converring this.

HHHartmann commented 4 years ago

Terry I see the problem, however this is sort of a showstopper on dev currently as the binary serial communication is broken. Maybe you could get a minimal subset on dev before landing your big PR? Or is it only a matter of days?

TerryE commented 4 years ago

Gregor, if you have a quick fix, then go with it.

HHHartmann commented 4 years ago

OK. after applying the discussed fix I can upload the LFS image using ESPlorer but then #2973 strikes and I still can't do anything.

matsievskiysv commented 4 years ago

@HHHartmann did you check image hashsum?

TerryE commented 4 years ago

I've done the tweaks as discussed to input.c and some other tidy up. ESPlorer upload works, as does nodemcu-tool. Unfortunately nodemcu-uploader doesn't; not sure why, but I don't know my way around the code well enough to debug this. For testing I did a stripped-down uploader, test-upload.lua which works fine, as does LFS loading:

~/nodemcu-firmware$ luac.cross -f -o /tmp/ftpserver.img  \
                      lua_examples/lfs/[d_]*.lua lua_modules/ftp/ftpserver.lua
~/nodemcu-firmware$ test-upload /tmp/ftpserver.img
/dev/ttyUSB0 connected to fd:3
............................................................................................................
12075 bytes written to ftpserver.img
~/nodemcu-firmware$ nodemcu-uploader terminal
--- Miniterm on /dev/ttyUSB0  115200,8,N,1 ---
--- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---

> node.startupcommand('node.flashindex("_init")()')
> node.flashreload'ftpserver.img'

...
NodeMCU 3.0.0.0 
    branch: dev-lua53-2
    commit: 18282e2a2a7689f7e7c461eb19813d92ace92d42
    release: 2.0.0-master_20170202 +456
    release DTS: 201911152024
    SSL: false
    build type: float
    LFS: 0x0
    modules: adc,bit,file,gdbstub,gpio,net,node,tmr,uart,wifi
 build 2019-12-07 18:27 powered by Lua 5.1.4 on SDK 3.0.1-dev(fce080e)

> -- Note how the _init module is now executed automatically
> =LFS
table: 0x3fff09a8
> =unpack(LFS._list)
dummy_strings   _init   ftpserver
> =LFS.ftpserver
function: 0x3ffefe28

BTW, this is on my working branch. I am not sure why LFS is stalling for you Gregor. Sorry.

The make LUA=53 version also works, but the _init module isn't needed.

HHHartmann commented 4 years ago

@TerryE I had applied the changes to dev, so that't why I hit that other bug.

TerryE commented 4 years ago

Gregor, we need to merge in my 5.3 PR.

HHHartmann commented 4 years ago

Terry I think that would be a good idea. It fixes several annoying bugs and will bring us forward 51/53 ways.

NicolSpies commented 4 years ago

@HHHartmann, I used current dev for bin and LFS image build using Docker, still experiencing the mentioned LFS image upload problem using ESPlorer. ESPlorer info shows java.lang.NullPointerException. The uploading process hangs after a few 100 bytes.

Any advice on how to proceed ?

marcelstoer commented 4 years ago

@NicolSpies see Terry's comment about. His fix is not on dev yet but in his 5.3 branch.

NicolSpies commented 4 years ago

@marcelstoer ,thanks, again. In the meantime I have started using nodeMCU-tool via bash for windows.

TerryE commented 4 years ago

Fixed in #3075

NicolSpies commented 4 years ago

@TerryE , just confirmation feedback, the LFS upload using Esplorer is functional again and appears to be better than before. The #3075 fixing commit seems to speed-up the upload of all files. The LFS flashing is also faster. :1st_place_medal:

TerryE commented 4 years ago

@NicolSpies Thanks for the feedback Nicol. The PR definitely some introduced some gremlin that is impacting some automoted upload algos. I am tracking this down. We could really do with an automated test coverage suite, not only for core Lua functions but also for core NodeMCU components such as node, file, wifi and uart