Closed jmattsson closed 5 years ago
Test code:
wifi.start()
tst = net.createServer(net.TCP, 600)
tst:listen(9999, function(conn)
local n = 0
print('Incoming connection')
conn:on('receive', function(sck, data)
n = n + #data
end)
conn:on('disconnection', function()
print('Received a total of '..n..' bytes')
conn:close()
end)
end)
From the host:
$ dd if=/dev/zero bs=1k count=1500 | netcat -q0 172.16.10.67 9999
1500+0 records in
1500+0 records out
1536000 bytes (1.5 MB, 1.5 MiB) copied, 11.0634 s, 139 kB/s
^C
and you can end up with some unexpected results:
Incoming connection
Received a total of 1531692 bytes
in this case 4308 bytes short, aka 3x 1436.
Hmm. Does not appear to be the task_post()
calls that are failing. /digging deeper
...but I do see all the bytes going through the RCVPLUS event. So it's definitely us losing them somewhere.
Spoke too soon, I have now seen fewer bytes go through the RCVPLUS event too! Aaaaah!
Unless we're looking at a pbuf chaining thing again... hmm.
Okay, we're not handling pbuf chains, that explains it. Dammit I fixed that in the esp8266 net module at some point a long time ago already!
Alright, that aspect has been PR'd. Hey @devsaurus any chance of you looking into moving logic from the lwIP callback into the handle_net_event()
function to sort out the thread safety issues? As is, the LVM might pull the rug out from underneath the lwIP callback and have everything crater on us, which isn't ideal.
I'm not currently across this whole SYS_ARCH_PROTECT
stuff, and am also currently quite overworked due to imminent $work deadline.
Oh, and to easily reproduce the tcp-loss, kicking off a wifi scan while receiving those 1.5MB pretty reliably resulted in lost data (before the PR).
@devsaurus before you get tempted to dig in, it looks like I'll need to tackle this regardless since other aspects of this is biting us at $work now. Stay tuned for a second PR next week (I hope!)
Sorry that I missed this known issue from the esp8266 branch.
Oh, I was about to ask what is still pending ;) I have no clue atm tbh..
@devsaurus Nah, the git blame
suggests the clown that stuffed it up in the first place was I, you're in the clear :)
Closing this now that it's all been merged - thanks for the reviews & testing!
8<------------------------ BUG REPORT -----------------------------------------
Expected behavior
Data in a TCP connection should not be lost.
Actual behavior
Occasionally, when receiving lots of data, some data goes missing from within the TCP session. This can be seen for example when using the OTA example for the
otaupgrade
module. At times the image validation fails, and the image size reported does not match the size of the image actually sent to the device.Looking at the
net
module, there appears to be a possibility of losing events due to failingtask_post()
calls. Additionally, we have a bunch of thread-unsafe accesses to the userdata from the lwIP RTOS thread - any lwIP callback runs in the lwIP thread, and ideally it should not touch a Lua userdata what-so-ever, but if it absolutely has to, that would need to be properly mutex guarded.I'll have a look at it and see if I can fix it similarly to the recent uart fixes.
NodeMCU version
dev-esp32
Hardware
n/a
8<------------------------ END BUG REPORT -------------------------------------