nodemcu / nodemcu-firmware

Lua based interactive firmware for ESP8266, ESP8285 and ESP32
https://nodemcu.readthedocs.io
MIT License
7.66k stars 3.12k forks source link

esp32 net module loses TCP data at times #2824

Closed jmattsson closed 5 years ago

jmattsson commented 5 years ago

8<------------------------ BUG REPORT -----------------------------------------

Expected behavior

Data in a TCP connection should not be lost.

Actual behavior

Occasionally, when receiving lots of data, some data goes missing from within the TCP session. This can be seen for example when using the OTA example for the otaupgrade module. At times the image validation fails, and the image size reported does not match the size of the image actually sent to the device.

Looking at the net module, there appears to be a possibility of losing events due to failing task_post() calls. Additionally, we have a bunch of thread-unsafe accesses to the userdata from the lwIP RTOS thread - any lwIP callback runs in the lwIP thread, and ideally it should not touch a Lua userdata what-so-ever, but if it absolutely has to, that would need to be properly mutex guarded.

I'll have a look at it and see if I can fix it similarly to the recent uart fixes.

NodeMCU version

dev-esp32

Hardware

n/a

8<------------------------ END BUG REPORT -------------------------------------

jmattsson commented 5 years ago

Test code:

wifi.start()
tst = net.createServer(net.TCP, 600)
tst:listen(9999, function(conn)
  local n = 0
  print('Incoming connection')
  conn:on('receive', function(sck, data)
    n = n + #data
  end)
  conn:on('disconnection', function()
    print('Received a total of '..n..' bytes')
    conn:close()
  end)
end)

From the host:

$ dd if=/dev/zero bs=1k count=1500 | netcat -q0 172.16.10.67 9999
1500+0 records in
1500+0 records out
1536000 bytes (1.5 MB, 1.5 MiB) copied, 11.0634 s, 139 kB/s
^C

and you can end up with some unexpected results:

Incoming connection
Received a total of 1531692 bytes

in this case 4308 bytes short, aka 3x 1436.

jmattsson commented 5 years ago

Hmm. Does not appear to be the task_post() calls that are failing. /digging deeper

jmattsson commented 5 years ago

...but I do see all the bytes going through the RCVPLUS event. So it's definitely us losing them somewhere.

jmattsson commented 5 years ago

Spoke too soon, I have now seen fewer bytes go through the RCVPLUS event too! Aaaaah!

jmattsson commented 5 years ago

Unless we're looking at a pbuf chaining thing again... hmm.

jmattsson commented 5 years ago

Okay, we're not handling pbuf chains, that explains it. Dammit I fixed that in the esp8266 net module at some point a long time ago already!

jmattsson commented 5 years ago

Alright, that aspect has been PR'd. Hey @devsaurus any chance of you looking into moving logic from the lwIP callback into the handle_net_event() function to sort out the thread safety issues? As is, the LVM might pull the rug out from underneath the lwIP callback and have everything crater on us, which isn't ideal.

I'm not currently across this whole SYS_ARCH_PROTECT stuff, and am also currently quite overworked due to imminent $work deadline.

jmattsson commented 5 years ago

Oh, and to easily reproduce the tcp-loss, kicking off a wifi scan while receiving those 1.5MB pretty reliably resulted in lost data (before the PR).

jmattsson commented 5 years ago

@devsaurus before you get tempted to dig in, it looks like I'll need to tackle this regardless since other aspects of this is biting us at $work now. Stay tuned for a second PR next week (I hope!)

devsaurus commented 5 years ago

Sorry that I missed this known issue from the esp8266 branch.

Oh, I was about to ask what is still pending ;) I have no clue atm tbh..

jmattsson commented 5 years ago

@devsaurus Nah, the git blame suggests the clown that stuffed it up in the first place was I, you're in the clear :)

Closing this now that it's all been merged - thanks for the reviews & testing!