nodemcu / nodemcu-firmware

Lua based interactive firmware for ESP8266, ESP8285 and ESP32
https://nodemcu.readthedocs.io
MIT License
7.66k stars 3.12k forks source link

ESP32: Race between socket accept and receive #1970

Closed fwheinz closed 7 years ago

fwheinz commented 7 years ago

Expected behavior

When opening a TCP Listen socket and a client connects and immediately sends data, there should be a way to receive that data.

Actual behavior

Data sent is lost, since the registration of the receive callback most of the time happens after first data from the client arrived, if it immediately starts sending (as for example a web browser):

$ wget http://172.17.10.59/ --2017-05-12 15:00:09-- http://172.17.10.59/ Connecting to 172.17.10.59:80... connected. HTTP request sent, awaiting response...

This doesn't happen when manually testing with telnet, since there is a delay between the connect and sending the data:

$ telnet 172.17.10.59 80 Trying 172.17.10.59... Connected to 172.17.10.59. Escape character is '^]'. GET / HTTP/1.1 HTTP/1.0 200 OK Content-Type: text/html

<h1> Hello, NodeMCU.</h1> Connection closed by foreign host.

Test code

This is the example webserver code from the front page

-- a simple HTTP server
srv = net.createServer(net.TCP)
srv:listen(80, function(conn)
    conn:on("receive", function(sck, payload)
        print(payload)
        sck:send("HTTP/1.0 200 OK\r\nContent-Type: text/html\r\n\r\n<h1> Hello, NodeMCU.</h1>")
    end)
    conn:on("sent", function(sck) sck:close() end)
end)

NodeMCU version

dev-esp32 But I strongly suspect that other branches are affected, too.

LDmitryL commented 7 years ago

@Skyp3 maybe we are talking about one and the same?

1969

fwheinz commented 7 years ago

@LDmitryL Yes, this indeed looks related.

jmattsson commented 7 years ago

Tagging @djphoenix into the discussion.

This sounds odd to me. We don't actually tcp_accept() the new socket until right before the listen/accept callback is invoked. Are we hitting a multi-core race here where the other core goes off and receives data while we're running the Lua callback? Could we fix this by only calling tcp_accept() after we've run the callback? Of course, we'd need to deal with the scenario where the callback throws an error in that case to ensure we don't lose a pcb...

LDmitryL commented 7 years ago

I tried to connect TCP server separately. And always when there is a pause between the fact of connection and send command to GET all the transfers are done. But if you give the GET immediately after joining, we don't get events with receive data.

djphoenix commented 7 years ago

@jmattsson I think tcp_accept may be safe moved after Lua callback because exception in CB anyway causes reboot of module. Multi-core race is possible reason for this issue. Maybe some logging in net module may help to discover it.

jmattsson commented 7 years ago

I've whipped up a completely untested branch here since I haven't got my kit available at the moment. Could someone see whether things get better or worse with that branch?

LDmitryL commented 7 years ago

@jmattsson There are no changes. In my version of the http server are all described errors are present.

jmattsson commented 7 years ago

Thanks for testing! Back to the square one with understanding this problem then. Hmm.

fwheinz commented 7 years ago

@jmattson If you look into the lwip-code, the call to tcp_accepted only decrements the backlog-counter for the listen socket, so it is essentially a no-op for this issue. It indeed smells like the race is already present in lwip, but usually doesn't kick in, since the time gap between accept and registering the receive-callback is small enough.

In the ESP32-Code, however, it is a bit delayed by the call to

from net_accept_cb() from post_net_accept() task_post_medium (net_event, (task_param_t)ev);

Then the net_event-function, which in turn calls the lua-code to register the callback, is eventually called by the scheduler, but most of the time too late, as it seems.

LWIP should provide (or has? I am no expert here) a way to register the callback after a listen socket connect without the possibility of losing data if the receive-callback isn't registered fast enough. If this is not possible in the short run, a workaround could be to call the lua-callback directly from post_net_accept:

--- snip ---

diff --git a/components/modules/net.c b/components/modules/net.c index 3e9611c..06955f3 100644 --- a/components/modules/net.c +++ b/components/modules/net.c @@ -309,7 +310,9 @@ static err_t net_sent_cb(void arg, struct tcp_pcb tpcb, u16_t len) { }

+static void laccept_cb (lua_State L, lnet_userdata ud, struct tcp_pcb newpcb); static bool post_net_accept (lnet_userdata ud, struct tcp_pcb *newpcb) { +laccept_cb (lua_getstate(), ud, newpcb); return true;

--- snap ---

First tests showed, that this works better, but it is of course not a clean solution.

EDIT: Thinking again, maybe this solution is not so bad as it seemed at first glance... Since everything happens in the same thread, there should be no race any more this way.

LDmitryL commented 7 years ago

@Skyp3 I understand that you tweaked the code in net.c and all started to work. And could you share this file? And that is very necessary. I'll compile with it. Can't wait for the official hotfix.

fwheinz commented 7 years ago

Hi LDmitryL,

You just have to add two lines to net.c around the function start of post_net_accept (line 311 ca.):

That made it work for me.

Skyp3

LDmitryL schrieb:

@Skyp3 https://github.com/skyp3 I understand that you tweaked the code in net.c and all started to work. And could you share this file? And that is very necessary. I'll compile with it. Can't wait for the official hotfix.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodemcu/nodemcu-firmware/issues/1970#issuecomment-303972461, or mute the thread https://github.com/notifications/unsubscribe-auth/AbU3EiD9n90RnPudOE44DlAt7O0HHQlZks5r9VDWgaJpZM4NZWll.

jmattsson commented 7 years ago

The problem with that particular work-around is that the Lua callback is invoked from a non-Lua thread. Since the Lua VM is decidedbly not reentrant nor concurrent, it's begging for a crash. You might get away with it for some time depending on what else is happening in the system (timers, other sockets, gpios, etc), but sooner or later it will blow up :)

I think I have a "real" solution for this problem. I'm still not set up to test it properly, so if someone with this issue could take this branch for a spin and let us know if it fixes the issue, that would be fabulous.

LDmitryL commented 7 years ago

Yes! This fix improved the net.listen() with Internet browsers. But there are a few comments: If you close the connection after transmitting the page directly in code S:on("receive",function(conn,request) that the browser "error while loading page the connection was reset" If you remove the code from my server conn:close(); (only after timeout in net.createServer(net.TCP,60)) then everything is fine. Apparently it's working Keep-Alive. But I'm in the page header to indicate "Connection: close\r\n". Well and accordingly it is impossible to put the closure on S:on("sent",function(ss,buf) ss:close(); end) causes the error: PANIC: unprotected error in call to Lua API (Nhttpd.lua:115: attempt to index local 'ss' (a nil value)) And if, at the time of request of the page by the browser to issue another request of this page (double click, enter) getting the message Send error: httpd.lua:35: not connected (It is the operator of conn:send(line);) In principle, the right! As I opened the connection to the server first enter and then closed it repeated query. But before the end of the interrupted pages are still lot of things left out. Eventually, at some iteration operator send(line) error occurs PANIC: unprotected error in call to Lua API (httpd.lua:35: not connected) Sometimes with an error about out of memory. It seems like the user's actions should be caught by invoking the send function in pcall() and then forcibly close the connection. But this again brings us back to the situation with browser error described at the beginning.

jmattsson commented 7 years ago

Thanks for testing!

It sounds like we have a problem with losing the internal Lua socket reference, leading to the panic when using ss, and same with the not connected error. I'll see if I can find that bug too, stay tuned!

jmattsson commented 7 years ago

Okay, I've added a commit on that branch, hopefully fixing the nil socket arg error.

Issuing a :close() while there is data queued to be sent will still likely cause the browser to report "error while loading page, connection reset", but if I didn't screw things up, it should now be possible to hook the :on('sent', cb) and do the :close() from there. It should also prevent the not connected from appearing.

Please keep an eye on memory usage after this - I may have missed a releasing the self-reference somewhere, in which case sockets won't get garbage collected.

fwheinz commented 7 years ago

@jmattsson Is there a similar race when connecting to a server as a client, when the server immediately sends data?

jmattsson commented 7 years ago

...

That's... an excellent question. I'll try to check tonight!

LDmitryL commented 7 years ago

@jmattsson I launched a page with updates every 3 seconds and watching memory leak of about 200 bytes (196-228) with each page request. This size does not fit the request and response. If you reduce memory to the size of 120-150 KB interesting stuff happens with WiFi. Lose connection to the router and a circular connection. The connection does not work until restart. Garbage collection does not help. Here is the last request before loss of communication GET:get.pht, MemUsage:3344 (node.heap()=126344) -what remains after issuance of the page to the client. And then:

I (431582) wifi: bcn_timout,ap_probe_send_start I (434092) wifi: ap_probe_send over, resett wifi status to disassoc I (434092) Wi-Fi: state: run -> init (1) I (434092) wifi: pm stop total sleep time: 0/409873523 I (434092) Wi-Fi: n:0 11, o:11 0, ap:255 255 sta:11 0, prof:1 Disconnected: reason= 200 Disconnected: reason= 201 I (435912) Wi-Fi: n:0 11, o:11 0, ap:255 255 sta:11 0, prof:1 I (435912) Wi-Fi: state: init -> auth (b0) I (435922) Wi-Fi: state: auth -> assoc (0) I (435922) Wi-Fi: state: assoc -> run (10) I (436172) Wi-Fi: connected with ldl_rout, channel 11

And every time a page request is accompanied by a similar gap. Don't know the router or the program?

jmattsson commented 7 years ago

@Skyp3 The answer is no, there isn't. With the client, you register your on() handlers before commencing the connect(), so they're ready to go at the same time.

jmattsson commented 7 years ago

@LDmitryL Okay, I clearly made things worse with that commit. I'm going to have to find some serious time to sit down and analyse the intended life times of Lua objects vs lwIP objects, and the interplay between threads. For now I've reverted the last change.

Do you think it would be worthwhile to get the early-receive fix onto the main dev-esp32 at this point, or is it not worth it until we get the other bit fixed too?

fwheinz commented 7 years ago

@jmattsson

The answer is no, there isn't. With the client, you register your on() handlers before commencing the connect(), so they're ready to go at the same time.

Ah, perfect! Maybe this is an option, too, for listen sockets? Just set the callback at lua level on the listen socket and set it in the lwip-accept-callback for the pcb without having to execute lua code for it?

jmattsson commented 7 years ago

I did consider it, but it would mean a change to the Lua API. The first commit here did seem to fix the lost-data-right-after-accept, and it was fairly non-intrusive and didn't require an API change, so I'm in favour of it at the moment.

Once I can sit down and work through what's going on with the self_ref on the Lua socket (the cause of the nil crash above), I might change my mind, depending on what I might uncover. I'm hoping I won't have to change the net API, it's been pretty stable so far.

fwheinz commented 7 years ago

@jmattsson I noticed that with certain return codes from a lwip-callback data can be rejected by upper layers and can be hold back (see lwip/core/tcp_in.c:439). Maybe this can be of some use here?

LDmitryL commented 7 years ago

@jmattsson Memory leak was before this hotfix. I think you need to accept these changes. Without them does not work. Well, to watch for memory leaks.

jmattsson commented 7 years ago

I've spent a few hours looking at this this evening. I've made a couple further changes to a) reduce the risk of a race condition being triggered, and b) ensure that after a socket:close() the internal reference is dropped, allowing the object to be garbage collected once the user code stops referencing it.

That said, the more I look at this, the less convinced I am that it's safe to use lwIP in a multi-core setup. I can't see an obvious way of properly locking access to the pcbs between the lwIP thread and the Lua thread. One possible method might be to use the regular poll callback from lwIP, and RTOS-queue everything for processing in that callback. More work/thinking needed. If anyone else wants to get a headache from contemplating multi-core concurrency interactions, feel free to ponder! :)

LDmitryL commented 7 years ago

And is there a possibility to fix specific SRV when calling the srv = net.createServer(net.TCP) srv:listen(80, function(conn) on a single core at the moment of creation? It seems to me that if you start to struggle with multi-core in LUA this can lead to very strong changes in the system! The latter made the changes you posted in https://github.com/jmattsson/nodemcu-firmware/tree/esp32-net-accept-race ? I can download and test?

jmattsson commented 7 years ago

No, lwIP and the Lua VM run as completely separate tasks, and ideally we want all of the low-level stack stuff running on one core, and the slower Lua stuff on the other. It'll just take some thinking and careful adjusting to make it all work in harmony :)

And yeah, feel free to test the latest update. It should be at least as good as before. I want to make one more change to make the socket:close() not tear everything down with a RST rather than a FIN, since with the former you have to have sent a Content-length header to avoid the browser saying it can't load the page. The downside of that change is that low-level sockets will hang around in TIME_WAIT state, using up RAM. It's clear by now that using tcp_abort() rather than tcp_close() is the wrong workaround for that issue however.

LDmitryL commented 7 years ago

@jmattsson Downloaded the latest hotfix. Before this fix, the memory leak was approximately 200 bytes, now it leaks from 1 to 2 kilobytes. But the truth is sometimes observed and a memory reset is also about 1-2 kilobytes. And in the end, the balance is still negative and there is a memory leak! As told to doctors: "the path to recovery is through the aggravation". ;) Need to know more @Skyp3 watches whether it is a memory leak in your version of the server.

LDmitryL commented 7 years ago

@jmattsson In the process of testing this fix I noticed: I periodically page makes to the server 2 of the request. One query has variables the other is not. And here is where were variables in the query, a memory leak was minimal and sometimes negative ( returning memory more than it was before the request). I'm in the HTML page ordered the transfer of one parameter. Memory leak returned to values of about 200 bytes for each request. Apparently the server:on("receive",function(conn,request), not the entire transmit buffer clean at the exit.

devsaurus commented 7 years ago

@Skyp3 @LDmitryL I'm working on a port of net module for ESP32 from raw to netconn API at https://github.com/devsaurus/nodemcu-firmware/tree/esp32-netconn It's work-in-progress but already functional for TCP client and server (not UDP) and passes the test code example above. If possible, please give it a try and feed back your results here.

You might observe a memory leak for each connection, but heap should return within ~2 minutes. We can't decrease this timeout easily until espressif/esp-idf#783 is resolved. Also enable menuconfig item "Component config -> LWIP -> Enable SO_REUSEADDR option".

LDmitryL commented 7 years ago

@devsaurus There are a few questions to this realization: When transferring a large file (parts 1007 bytes) I called in a loop: status, error = pcall(function () conn:send(buf); end); If an error occurs, I just paused and repeated it. Everything worked. Now at this point I get some messages: would block and connection reset Now what you need to do in these cases? When to repeat and when to finish the results at all? Need comments to those mistakes. The operator conn:close() at the end of transmission messages is still not working correctly? In this case, the WWW browser shows error on the query. Without close() the request to the browser transmitted. Probably need to wait until the end of transmission of all the parts and then call close(). But how? After the transmission of the message occurs 3 or 4 events :on("sent"). And sometimes 10. What's the message running net_delete() and netconn_delete() executed when you close and disconnect the connection? A bonus from this implementation of a real acceleration (accelerating from 12 seconds to 4 seconds when a transaction) Here is the acceleration. Thank you for this work!!!

LDmitryL commented 7 years ago

@devsaurus 1-Wire module is not working on this fix. On the old firmware works. In menuconfig module is selected. Error messages and warnings no! Used a simple code to test search all devices

pin = 13; gpio.config( { gpio=pin, dir=gpio.IN, pull=gpio.PULL_UP }) ow.setup(pin); addr = ow.reset_search(pin); count = 0 repeat count = count + 1 addr = ow.search(pin) print("Sens adr=",addr) until (addr == nil) or (count > 5)

Sens adr= nil There is the possibility of independent testing?

LDmitryL commented 7 years ago

At some point in time during the connection:

Guru Meditation Error of type LoadProhibited occurred on core  0. Exception was unhandled.
Register dump:
PC      : 0x400d5b9e  PS      : 0x00060d30  A0      : 0x800d5c35  A1      : 0x3ffb9610  
A2      : 0x3ffbb7f0  A3      : 0x3ffcc818  A4      : 0x00000001  A5      : 0x3ffcace8  
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x00000008  A9      : 0x00000000  
A10     : 0x3ffd70e8  A11     : 0x3ffb9610  A12     : 0x00000036  A13     : 0x3f404ffe  
A14     : 0x68680006  A15     : 0x3ffd1df0  SAR     : 0x0000001a  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000028  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0xffffffff  
Backtrace: 0x400d5b9e:0x3ffb9610 0x400d5c32:0x3ffb9640 0x400efb9d:0x3ffb9660 0x400efd05:0x3ffb9680 0x400f0dfc:0x3ffb96b0 0x400d1582:0x3ffb96f0
Rebooting...
LDmitryL commented 7 years ago

Because of the not working module 1W had to go back on https://github.com/jmattsson/nodemcu-firmware/tree/esp32-net-accept-race . It is assembled on the ESP-IDF v2.0-rc1-803-g1e0710f1. A new patch is assembled on v3. Maybe 1W of for this not working?

LDmitryL commented 7 years ago

And one more comment to the new hotfix: If the previous fix browser got an answer, and after a few events :on("sent"), complete transfer of the buffer, advancing the event :on("disconnection"). The new fix it is not. The browser was supposed to close the connection due to the sent header in HTML page Connection: close. Even though I don't close connections server side. Previous fix worked correctly in this case. Some browser requests are not completed!

devsaurus commented 7 years ago

Thanks for the detailed results @LDmitryL!

1-wire: Sounds like a regression during the last IDF updates, I'll check here and follow up.

The would block error could be due to a send buffer overflow. Please avoid calling send() function in a loop, that won't work reliably. Use the sent callback like in this example:

function send_chunk(sock)
  local data = file.read()
  if data then
    sock:send(data, function (c) send_chunk(c) end)
  else
    file.close()
    print("sending done")
  end
end

file.open("file_to_send")
send_chunk(conn)

The running net_delete etc messages are some debugging printfs which I didn't remove so far. They show that closing the connection is sometimes delayed. I'm still investigating what is going on here.

devsaurus commented 7 years ago

@LDmitryL I've fixed the onewire problem on dev-esp32 and rebased my esp32-netconn branch onto it so that onewire should work again with the experimental code.

I'll need to understand the Guru Meditation Error next... Does it happen randomly or always after some time? Does your code do something special when this happens?

LDmitryL commented 7 years ago

It happened once (probably need a longer study). I just sent the connection request. And got a Guru Meditation Error. To repeat this error did not happen.

devsaurus commented 7 years ago

Ok, let's see if it happens again. I added some sanity checks that should prevent dereferencing NULL. In addition, the issue with delayed close from client side should be fixed now.

jmattsson commented 7 years ago

@devsaurus' rewrite of the net module has now been merged, so this issue should be properly fixed now. Give us a yell if it isn't! :)

marcelstoer commented 7 years ago

dev-esp32 But I strongly suspect that other branches are affected, too.

Has this ever been verified?

jmattsson commented 7 years ago

Has this ever been verified?

Not that I'm aware of right now. It could be that the single-core, run-to-completion model on the 8266 shields us from the issue, but I could equally well see that it doesn't, given we still post cross-task there. However, I don't think anyone has complained about it, so that would be a suggestion it's not a problem there.