nanoframework / Home

:house: The landing page for .NET nanoFramework repositories.
https://www.nanoframework.net
MIT License
857 stars 78 forks source link

Wifi Access Point Mode - HttpListener hangs #1335

Open Alex-111 opened 1 year ago

Alex-111 commented 1 year ago

Target name(s)

ESP32-S3 DevkitC-1

Firmware version

latest - 1.8.1.370 ESP32-S3

Was working before? On which version?

No response

Device capabilities

No response

Description

When setting up a SoftAP the HttpListener sometimes does not accept any new requests.

How to reproduce

I started with the provided sample code "WifiAP" and would like to setup a simple Wifi Access Point with a very basic webserver. I tested with my Android phone to connect to the SSID and via Browser I requested http://192.168.4.1. The first request seems to work...

But especialy, when I connect my Smartphone to another SSID and then return back to my nanoFramework AP, no requests are accepted anymore and the browser just hangs.

It just seems that the socket listener just does not return anymore.

Here is my smaple code: https://github.com/Alex-111/WiFiAPTest/tree/master

Expected behaviour

I would expect that HttpListener always accepts webrequests,regardless if I connect my Smartphone to another WIFI and then later connect it again to the SoftAP.

Screenshots

No response

Aditional information

No response

josesimoes commented 1 year ago

@Ellerbach wondering if this is somewhat related (or similar) with the fix you've made the other day on the webserver...

Ellerbach commented 1 year ago

I tested the code and it works as expected for me: First request, I open a browser and went to the 192.168.4.1 page. Then I connected to another SSDI Then I connected back to the MySsid Then went again to the page:

image

Ellerbach commented 1 year ago

I've been repeating multiple times with different processes (closing the browser before leaving, leaving it open, refresh, etc), it always worked as expected. So closing this issue. This may be due to the browser, phone specific.

Alex-111 commented 1 year ago

@Ellerbach Thanks for testing!

Please could you tell me more about your setup: What firmware do you use? What device do you use? Maybe it is specific to a special device/firmware combination?

alberk8 commented 1 year ago

I have the same issue as @Alex-111 and I am using Android Phone.

Alex-111 commented 1 year ago

@josesimoes As more peaple have this issue I think we should investigate a little bit more before closing the issue?

josesimoes commented 1 year ago

@Alex-111 : @Ellerbach owns this issue, up to him. 😉

Ellerbach commented 1 year ago

Firmware: ESP32_REV0-1.8.1.419 Device: ESP32 (a basic one) Phone: iPhone

So let me reopen the issue, I'll try with other devices then.

Ellerbach commented 1 year ago

I've tried this time with ESP32-S3 Firmware: ESP32_S3-1.8.1.375 Phone: iPhone

Still works as expected!

Ellerbach commented 1 year ago

Just tried with an Android phone (Samsung) and it also works as expected. I tries with the ESP32-S3. Same scenario, connection to the SSID, confirmation that I want to use the network without internet, connecting to the 192.168.4.1, getting the page. Connecting to another SSID, doing something, connecting back to the MySsid, and same, confirming I want to use without network, going to 192.168.4.1, page loads perfectly.

So I'm really not sure what's happening with both @alberk8 and @Alex-111 but I cannot reproduce your problem with ESP32, ESP32-S3, iPhone and Android!

alberk8 commented 1 year ago

Are you closing and opening the web page again?

To replicate 1) Connect to nf AP 2) Open browser to http://192.168.4.1 (web page loads) 3) Change to another AP and wait for a few seconds 4) Change back to nf AP. 5) Go back to the page in step 2 and refresh. On Android I just swipe down. 6) The page will be loading.........

Alex-111 commented 1 year ago

Are you closing and opening the web page again?

To replicate

  1. Connect to nf AP
  2. Open browser to http://192.168.4.1 (web page loads)
  3. Change to another AP and wait for a few seconds
  4. Change back to nf AP.
  5. Go back to the page in step 2 and refresh. On Android I just swipe down.
  6. The page will be loading.........

Especially take care at step 5 sometimes the pages appears as expected because of the browser cache but you still see the loading indicator, i.e. the browser cannot get data... Also on the debug output there is no request visible anymore. Maybe it is also related to the hardware configuration. I think @alberk8 and I are using a device without PSRAM -> (ESP32-S3-DevkitC-1 in my case)

alberk8 commented 1 year ago

Additional Context. If I wait long enough like 5 minutes there is an error. A new listener is created then the page refresh without issue. The same thing also happen when I run the app in ESP32 or ESP32_S3, with or without PSRAM.

listener.GetContext()
Get Context 1, this is next line after the _listener.GetContext()
    ++++ Exception System.Net.Sockets.SocketException - 0x00000000 (4) ++++
    ++++ Message:
    ++++ System.Net.InputNetworkStreamWrapper::Read_HTTP_Line [IP: 015a] ++++
    ++++ System.Net.HttpListenerRequest::ParseHTTPRequest [IP: 000d] ++++
    ++++ System.Net.HttpListenerContext::get_Request [IP: 000d] ++++
    ++++ WifiAP.WebServerSimple::RunServer [IP: 0031] ++++
Request:
Process Request Ends
    ++++ Exception System.Net.Sockets.SocketException - CLR_E_FAIL (4) ++++
    ++++ Message:
    ++++ System.Net.Sockets.NativeSocket::send [IP: 0000] ++++
    ++++ System.Net.Sockets.Socket::Send [IP: 0018] ++++
    ++++ System.Net.Sockets.NetworkStream::Write [IP: 0051] ++++
    ++++ System.Net.HttpListenerResponse::SendHeaders [IP: 003f] ++++
    ++++ System.Net.HttpListenerResponse::Close [IP: 0010] ++++
    ++++ WifiAP.WebServerSimple::RunServer [IP: 0031] ++++
System.Net.Sockets.SocketException: Exception was thrown: System.Net.Sockets.SocketException
Ellerbach commented 1 year ago

Are you closing and opening the web page again?

I did with various variation:

All worked as expected! The ESP32 device I'm using do not have PSRAM, it's the very basic one, the ESP32-S3 is a DevKit-M. Works fine with Edge as a browser on both iPhone and Android! So I'm sorry but I really can't reproduce this :-( That would make things much easier!

Alex-111 commented 1 year ago

Yes. It is very strange, that it works without issues on your side, but I've exactly the same siuation as @alberk8 So let's think again what is the difference?

My setup: image

The packages I use:

packages\nanoFramework.Iot.Device.DhcpServer.1.2.300\lib\Iot.Device.DhcpServer.dll True
<Reference Include="mscorlib, Version=1.14.3.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.CoreLibrary.1.14.2\lib\mscorlib.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="nanoFramework.ResourceManager, Version=1.2.13.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.ResourceManager.1.2.13\lib\nanoFramework.ResourceManager.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="nanoFramework.Runtime.Events, Version=1.11.6.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.Runtime.Events.1.11.6\lib\nanoFramework.Runtime.Events.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="nanoFramework.Runtime.Native, Version=1.6.6.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.Runtime.Native.1.6.6\lib\nanoFramework.Runtime.Native.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="nanoFramework.System.Collections, Version=1.5.18.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.Collections.1.5.18\lib\nanoFramework.System.Collections.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="nanoFramework.System.Text, Version=1.2.37.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.Text.1.2.37\lib\nanoFramework.System.Text.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="System.Device.Gpio, Version=1.1.28.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.Device.Gpio.1.1.28\lib\System.Device.Gpio.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="System.Device.Wifi, Version=1.5.54.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.Device.Wifi.1.5.54\lib\System.Device.Wifi.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="System.IO.Streams, Version=1.1.38.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.IO.Streams.1.1.38\lib\System.IO.Streams.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="System.Net, Version=1.10.52.0, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.Net.1.10.52\lib\System.Net.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="System.Net.Http">
  <HintPath>packages\nanoFramework.System.Net.Http.Server.1.5.97\lib\System.Net.Http.dll</HintPath>
</Reference>
<Reference Include="System.Net.Sockets.TcpClient">
  <HintPath>packages\nanoframework.System.Net.Sockets.TcpClient.1.1.52\lib\System.Net.Sockets.TcpClient.dll</HintPath>
</Reference>
<Reference Include="System.Threading, Version=1.1.19.33722, Culture=neutral, PublicKeyToken=c07d481e9758c731">
  <HintPath>packages\nanoFramework.System.Threading.1.1.19\lib\System.Threading.dll</HintPath>
  <Private>True</Private>
</Reference>
<Reference Include="Windows.Storage">
  <HintPath>packages\nanoFramework.Windows.Storage.1.5.33\lib\Windows.Storage.dll</HintPath>
</Reference>
<Reference Include="Windows.Storage.Streams">
  <HintPath>packages\nanoFramework.Windows.Storage.Streams.1.14.24\lib\Windows.Storage.Streams.dll</HintPath>
</Reference>

Same situation in debugger or without debugger attached...

@Ellerbach Any idea what else we could check?

Alex-111 commented 1 year ago

@Ellerbach @alberk8 I've done some new tests and want to share my observations:

Unfortunately I still do not know what exactly causes the hanging. But maybe some of you can investigate the native code, For me it really looks like the Socket Accept does not return.

Any idea what happens in this socket code, if there are two requests in parallel? Is it ensured that no request is lost?

image

Ellerbach commented 1 year ago

Any idea what happens in this socket code, if there are two requests in parallel? Is it ensured that no request is lost?

The sample is done in a very simple way, not ment to scale. Use the "real" WebServer nuget to get all working with multiple parallel requests at the same time. Now, that comes with the cost of size. The sample is done how to set the device where you typically have 1 and unique phone connecting 1 and unique time :-) And where you can retry but just rebooting the device.

Btw, glad you figured out a way. PR to improve the robustness of the sample is always welcome btw!

Alex-111 commented 1 year ago

@Ellerbach I'm aware of the drawbacks of this simple webserver but regardsless which webserver I use. The issue stays....

I also tried your webserver nuget, but when looking at the code of the full featured webserver there is no difference. Both use the HttpListener which in my opinion have the some problems in this case.... There is the same "_listener.GetContext()" which just does not return in that case... i.e. this has nothing to do with the webserver itself...

Ellerbach commented 1 year ago

I also tried your webserver nuget, but when looking at the code of the full featured webserver there is no difference. Both use the HttpListener which in my opinion have the some problems in this case.... There is the same "_listener.GetContext()" which just does not return in that case... i.e. this has nothing to do with the webserver itself...

Let me look at this as well then. Note that on the ESP side, there are also bad behavior on the socket and it's related to Espressif, nothing we can change. Here is an example:

And in this scenario, that's related to how things are managed on the Espressif side. totally independent of anything on the nano side unfortunately. So you'll see some side effects like this one that you cannot control. This is done differently on devices like the STM32.

Those devices are not ment to be highly scalable as web servers or sockets but rather handle one, at best few.

Alex-111 commented 1 year ago

@Ellerbach thanks for your answer. THis sounds really similar to the issue we have here. But isn't there a way to work around this, e.g. maybe there is a possibility to setup a timeout for the blocking, so that it does not block forever.

Imagine you have a iot-device which is able to be configured via SoftAP. If anybody connects and just goes away without closing the socket connection, then we would be forced to reset the device. THis is really not what we want...

Ellerbach commented 1 year ago

Imagine you have a iot-device which is able to be configured via SoftAP. If anybody connects and just goes away without closing the socket connection, then we would be forced to reset the device. THis is really not what we want...

You definitely can add a timeout, that's totally possible. Still, lower level, there are some things that can break. For example, I4ve been using an ESP based device flashed with WLED (I'm using it for notifications). And if I use this device for the tests we're running here (I've tried ;-)), then it will be fully blocked. Nothing I can do except rebooting it. And it's native C, directly using the Espressif API. You can definitely add a timeout, that will help btw in your scenario. But again, those are far to be perfect! Add a watchdog, dispose everything thru a timer, things like this definitely is a good practice in all cases!

Alex-111 commented 1 year ago

@Ellerbach I updated my repro to try to stop the HttpListener on WIFI disconnection. Is this what you mean I should do on timeout? To dispose the HttpListener on some conditions? Or is there another timeout parameter I'm not aware of?

My sample Repro is working better with this new logic, but still there are some situations where it just blocks, even if I dispose the HttpListener and create it again after a WIFI-client connects....

If this is really the best we can get, than I would have expected a little bit more reliability... Not sure if this is something which could go beyond a hobby project in that case?

Another thought: Couldn't we open a ticket at Espressif, if this is a known issue?

Ellerbach commented 1 year ago

Yes, you basically have to play with all this. You can also add a big try {}catch {} in the Main function with a global mechanism. If you want, you can also periodically restart the webserver. Things like this.

Another thought: Couldn't we open a ticket at Espressif, if this is a known issue?

I'm sure one is open among the 1K+ issues ;-) https://github.com/espressif/esp-idf/issues There are 57 open just with socket and some seems very similar to the problem I describe.

networkfusion commented 8 months ago

IDF has been updated since the last comments. Is this still blocked?

Ellerbach commented 5 months ago

As it's been 3 months since the last feedback on this issue, I'm closing it. If the problem persists, feel free to reopen it.

Alex-111 commented 3 months ago

@Ellerbach @AdrianSoundy It could not be tested because of https://github.com/nanoframework/Home/issues/1488 But now with latest firmware it is still not responding after a few requests on my tests: with ESP_S3. Tested with WifiAP project from samples. It seems ticket 1488 is still not fixed to 100%. Therefore we cannot debug code with Visual Studio at the moment. Will test more, when debugging works again.

Ellerbach commented 2 months ago

So, reopening the issue. Thanks for providing updates.

Alex-111 commented 2 months ago

@Ellerbach

now #1493 is fixed and I did some further tests with my S3 and the WIFIAP sample code. When it hangs it always blocks at this line and does not return from writing to the stream. To make it block I just have to refresh the webpage (with "pull to refresh") from my Android phone about 2 or 3 times. After this it completely hangs and it has to be rebooted:

image

Any ideas why this could happen? It feels like a deadlock.

Edit: I left the dubber running and so I just found out that after some minutes maybe 10 or 15 the blocking code (writing to stream) returns with: ++++ Exception System.Net.Sockets.SocketException - CLR_E_FAIL (4) ++++ ++++ Message: ++++ System.Net.Sockets.NativeSocket::send [IP: 0000] ++++ ++++ System.Net.Sockets.Socket::Send [IP: 0018] ++++ ++++ System.Net.Sockets.NetworkStream::Write [IP: 0051] ++++ ++++ System.Net.OutputNetworkStreamWrapper::Write [IP: 0022] ++++ ++++ WifiAP.WebServer::OutPutByteResponse [IP: 001d] ++++ ++++ WifiAP.WebServer::ProcessRequest [IP: 0070] ++++ ++++ WifiAP.WebServer::RunServer [IP: 003b] ++++ Exception thrown: 'System.Net.Sockets.SocketException' in System.Net.dll An unhandled exception of type 'System.Net.Sockets.SocketException' occurred in System.Net.dll

Ellerbach commented 2 months ago

It definitely requires some investigations. And will require to instrument for debug the web server. If you are willing to, here is what I have in mind:

Alex-111 commented 2 months ago

@Ellerbach Meanwhile I had a look at the code and it seems to block here:

image

From my understanding this is not directly related to the webserver, but to the HttpListener.

response is if type HttpListenerResponse and in this line it is directly written to the stream, which seems to be a NetweorkStream -> Socket behind the scenes. So I fear we are here already on the native side?

Ellerbach commented 2 months ago

So I fear we are here already on the native side?

Check first on the WebServer side, there is maybe a way to prevent it to happen because the stream is not properly disposed or anything like this. Then, yes, it's about following the rabbit hole the same way with he http stack and then native.

Alex-111 commented 2 months ago

@Ellerbach I just tried to debug the managed side and pulled System.Net and System.Net.Http. I could get it to compile after upgrading all nuget packages and also I could deploy it via Visual Studio. But the debugger get not attached anymore. There is also no error in the Visual Studio logs.

On the serial line I see the following output. Seems that native code doesn't start anymore:

ESP-ROM:esp32s3-20210327<\r><\n> Build:Mar 27 2021<\r><\n> rst:0x1 (POWERON),boot:0x8 (SPI_FAST_FLASH_BOOT)<\r><\n> SPIWP:0xee<\r><\n> mode:DIO, clock div:1<\r><\n> load:0x3fce3818,len:0x1380<\r><\n> load:0x403c9700,len:0x4<\r><\n> load:0x403c9704,len:0xba4<\r><\n> load:0x403cc700,len:0x2c5c<\r><\n> SHA-256 comparison failed:<\r><\n> Calculated: 4020fa8290bd1c9845aee04dd4720555b4e4e5abf4f130e917b7f0c9a86e863e<\r><\n> Expected: ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff<\r><\n> Attempting to boot anyway...<\r><\n> entry 0x403c98f0<\r><\n>

<27>[0;32mI (45) boot: .NET nanoFramework 2nd stage bootloader ESP-IDF v5.1.3<27>[0m<\r><\n> <27>[0;32mI (45) boot: build Jun 10 2024 08:56:51<27>[0m<\r><\n> <27>[0;32mI (45) boot: chip revision: v0.1<27>[0m<\r><\n> <27>[0;32mI (49) boot.esp32s3: Boot SPI Speed : 80MHz<27>[0m<\r><\n> <27>[0;32mI (54) boot.esp32s3: SPI Mode : DIO<27>[0m<\r><\n> <27>[0;32mI (59) boot.esp32s3: SPI Flash Size : 8MB<27>[0m<\r><\n> <27>[0;32mI (63) boot: Enabling RNG early entropy source...<27>[0m<\r><\n> <27>[0;32mI (69) boot: Partition Table:<27>[0m<\r><\n> <27>[0;32mI (72) boot: ## Label Usage Type ST Offset Length<27>[0m<\r><\n> <27>[0;32mI (80) boot: 0 nvs WiFi data 01 02 00009000 00006000<27>[0m<\r><\n> <27>[0;32mI (87) boot: 1 phy_init RF data 01 01 0000f000 00001000<27>[0m<\r><\n> <27>[0;32mI (95) boot: 2 factory factory app 00 00 00010000 001a0000<27>[0m<\r><\n> <27>[0;32mI (102) boot: 3 deploy Unknown data 01 84 001b0000 002e0000<27>[0m<\r><\n> <27>[0;32mI (110) boot: 4 config Unknown data 01 82 00490000 00200000<27>[0m<\r><\n> <27>[0;32mI (117) boot: End of partition table<27>[0m<\r><\n> <27>[0;32mI (121) esp_image: segment 0: paddr=00010020 vaddr=3c0d0020 size=24f78h (151416) map<27>[0m<\r><\n> <27>[0;32mI (147) esp_image: segment 1: paddr=00034fa0 vaddr=3fc99e00 size=03ea8h ( 16040) load<27>[0m<\r><\n> <27>[0;32mI (149) esp_image: segment 2: paddr=00038e50 vaddr=40374000 size=071c8h ( 29128) load<27>[0m<\r><\n> <27>[0;32mI (157) esp_image: segment 3: paddr=00040020 vaddr=42000020 size=ce614h (845332) map<27>[0m<\r><\n> <27>[0;32mI (256) esp_image: segment 4: paddr=0010e63c vaddr=4037b1c8 size=0ebech ( 60396) load<27>[0m<\r><\n> <27>[0;32mI (265) esp_image: segment 5: paddr=0011d230 vaddr=600fe000 size=00064h ( 100) load<27>[0m<\r><\n> <27>[0;32mI (275) boot: Loaded app from partition at offset 0x10000<27>[0m<\r><\n> <27>[0;32mI (275) boot: Disabling RNG early entropy source...<27>[0m<\r><\n> Any ideas? Is this the right way to debug the managed framework code?
alberk8 commented 2 months ago

@Alex-111 , You should be able to debug as usual via VS when the app is deployed. It is easier (faster) to get support if you go to nF Discord server.

Ellerbach commented 2 months ago

@Alex-111 all libs should be all up to date as we do have automations for that. So not sure what's happening!

AdrianSoundy commented 2 months ago

The WiFiAP sample lacks some error handling which shows up when you quickly refresh the page.

There should be error handling in webserver.cs in ProcessRequest() Try catch around the main switch and another try catch around the response.Close();

Maybe response.Close(); should do its own exception handling internally to make sure the socket handle is closed.

From my testing it eliminates the exceptions causing a problem and the hangs from uncaught exceptions. You will always get exceptions when refreshing pages as the socket can be closed by browser when writing a response.

Maybe some more testing can be done with this change and the sample updated.

Blocking on the write can mean the browser is no longer reading the socket but still open. These sort of things time out eventually. You need a web server that can process multiple request. For WiFiAP to handle that you would need the ProcessRequest() to run on a separate worker thread.

Alex-111 commented 2 months ago

WebServer.cs.txt @AdrianSoundy Thanks for the hint. I tried them all:

Try Catch does not catch any exceptions in my case. It still is just hanging in ...Outputstream.Write.

I tried to go deeper into the nanoframework libraries, but after referencing System.Net and System.Net.Http as source code the debugger does not attach anymore. It just fails after deploying....

EDIT: I also tried to execute response.Close() in another thread, when the ...Outputstream.Write hangs. But the "hanging" is not released.... This very hacky test is attached as file in my comment -> see webserver.cs.txt