warmcat / libwebsockets

canonical libwebsockets.org networking library
https://libwebsockets.org
Other
4.68k stars 1.47k forks source link

Encountering ERR_CONNECTION_Closed to serve bigger file over https #3183

Closed Somritag closed 1 week ago

Somritag commented 3 weeks ago

Hello Andy, I am referring libwebsocket minimal example for https server, trying to open leaf.jpg (appx 7 mb) For SSL i am using Mbedtls

I see its always getting time out to serve this file or any file of large size, while the smaller files get served. we always hit here : / 1: something requested a callback when it was OK to write /

if ((pollfd->revents & LWS_POLLOUT) &&
    lwsi_state_can_handle_POLLOUT(wsi) &&
    lws_handle_POLLOUT_event(wsi, pollfd)) {
    if (lwsi_state(wsi) == LRS_RETURNED_CLOSE)
        lwsi_set_state(wsi, LRS_FLUSHING_BEFORE_CLOSE);
    /* the write failed... it's had it */
    wsi->socket_is_permanently_unusable = 1;

    **return LWS_HPI_RET_PLEASE_CLOSE_ME;**
}

And the __lws_sul_service_ripe() -> function pt->inside_lws_service = 1; hit->cb(hit); // here we are hitting lws_sul_wsitimeout_cb() pt->inside_lws_service = 0;

Attaching the log: image

lws-team commented 3 weeks ago

... could you perhaps spend more than two words in the title to describe your problem?

Somritag commented 3 weeks ago

Updated my issue, that was a hasty mistake.

In summary Https-> server (tried with minimal http server tls example code with a bigger file to load)-> Getting time out every time while serving Http->server (tried with minimal http server example code with a bigger file to load)-> Does not get time out and files get served successfully Post opening a WebSocket -> trying to load bigger file over https-> Server-> Able to serve the files

Is there any limitations in case of only https server?

lws-team commented 3 weeks ago

Is stuff transferring over tls? Eg, if you look with tcpdump, how far does it get?

What do the logs say in the failing case? What is the platform?

Somritag commented 3 weeks ago

Log: https-fail.txt

After some fragments it times out -> The Time is not fixed mostly around

We want to host a webserver on our stm32H5 controller Using libwebsockets for that purpose FreeRtos + LWIP + MbedTLS

Here we are unable to serve a single file. image

lws-team commented 3 weeks ago

Does this help?

https://github.com/warmcat/libwebsockets/pull/3165

Somritag commented 3 weeks ago

Hi Andy this did not fix my issue, I am attaching the log

https_serve_fail_log.txt

lws-team commented 3 weeks ago

It doesn't seem to show why it wants to close when it does. Without a timestamp on the logs, it makes it impossible to understand what is happening close in time, or after some delay.

closing [wsisrv|1|adopted|h1|h2]: parent [null wsi]
_realloc: size 49: lws_metrics_hist_bump_ (free heap 560831)
 [mux|0|h2_sid1_(wsisrv|1|adopted)|default|GET|/], his parent [wsisrv|1|adopted|h1|h2]: siblings:
closing [mux|0|h2_sid1_(wsisrv|1|adopted)|default|GET|/]: parent [wsisrv|1|adopted|h1|h2]

For a timeout there should be a TIMEOUT log before, for other problems they also should log. The question is how did we end up at closing [wsisrv|1|adopted|h1|h2]: parent [null wsi]

Somritag commented 3 weeks ago

https_serve_fail_leaf.txt https_file_size_10mb_error.txt

Attached the logs with probable time stamp : Instance 1 : closing [mux|0|h2sid1(wsisrv|1|adopted)|default|GET|/]: parent [wsisrv|1|adopted|h1|h2] disentangled from sibling [null wsi] [1970/01/01 00:00:32:0000] I: lws_close_free_wsi: shutdown conn: [mux|0|h2sid1(wsisrv|1|adopted)|default|GET|/] (sk -1, state 0x15) real just_kill_connection: sockfd -1 fd 5 events 517 -> 517 lwsi_set_state 0x20000015 -> 0x20000020 cce=1 [1970/01/01 00:00:33:0000] I: rops_destroy_role_h2: [mux|0|h2sid1(wsisrv|1|adopted)|default|GET|/]: ah det due to close [1970/01/01 00:00:33:0000] I: lws_header_table_detach: [mux|0|h2sid1(wsisrv|1|adopted)|default|GET|/]: ah 20053de0 (tsi=0, count = 2) [1970/01/01 00:00:33:0000] D: __lws_header_table_detach: [mux|0|h2sid1(wsisrv|1|adopted)|default|GET|/]: ah held 1720785130s, role/state 0x20000000 0x20, [1970/01/01 00:00:33:0000] I: lws_header_table_detach: nobody usable waiting [1970/01/01 00:00:32:0000] I: _lws_destroy_ah: freed ah 20053de0 : pool length 1 [1970/01/01 00:00:32:0000] I: lws_header_table_detach: [mux|0|h2sid1(wsisrv|1|adopted)|default|GET|/]: ah 20053de0 (tsi=0, count = 1) vh default: count_bound_wsi 2 tsi fds count 2 fd 6 events 517 -> 517 TIMEDOUT WAITING 15, dhdr 0, ah 20052628, wl 0

Instance 2: lws_serve_http_file_fragment: sending 4087 15 secs, reason 10 [1970/01/01 00:00:38:0000] D: lws_h2_frame_write: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5890930, nwsi->txcr=15328114 ssl_capable_write (4096) says 4096 [1970/01/01 00:00:38:0000] I: lws_h2_tx_cr_get: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/]: own tx credit 5886843: nwsi credit 15324027 [1970/01/01 00:00:38:0000] D: lws_serve_http_file_fragment: sending 4087 15 secs, reason 10 [1970/01/01 00:00:38:0000] D: lws_h2_frame_write: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5886843, nwsi->txcr=15324027 ssl_capable_write (4096) says 4096 [1970/01/01 00:00:38:0000] I: lws_h2_tx_cr_get: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/]: own tx credit 5882756: nwsi credit 15319940 [1970/01/01 00:00:38:0000] D: lws_serve_http_file_fragment: sending 4087 15 secs, reason 10 [1970/01/01 00:00:38:0000] D: lws_h2_frame_write: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5882756, nwsi->txcr=15319940 ssl_capable_write (4096) says 4096 [1970/01/01 00:00:38:0000] I: lws_h2_tx_cr_get: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/]: own tx credit 5878669: nwsi credit 15315853 [1970/01/01 00:00:38:0000] D: lws_serve_http_file_fragment: sending 4087 15 secs, reason 10 [1970/01/01 00:00:38:0000] D: lws_h2_frame_write: [mux|2|h2sid1(wsisrv|5|adopted)|default|GET|/] (eff [wsisrv|5|adopted|h1|h2]). typ 0, fl 0x0, sid=1, len=4087, txcr=5878669, nwsi->txcr=15315853 ssl_capable_write (4096) says 4096 sid 1, pending writable sid 0, pending writable fd 6 events 517 -> 519 [1970/01/01 00:00:38:0000] D: lws_serve_http_file_fragment says 0 [1970/01/01 00:00:38:0000] D: already pending writable sid 1, pending writable sid 0, pending writable fd 6 events 519 -> 519 fd 6 events 519 -> 519 TIMEDOUT WAITING 15, dhdr 0, ah 20052660, wl 0

lws-team commented 3 weeks ago

Not sure why this log is different and mentions TIMEDOUT, it seems to be sending up to the end. Every time it sends, it resets the timeout for 15s.

Time seems to go backwards 33 -> 32s on the https_serve_fail_leaf.txt is this because multiple things are pasted in the log?

If the time isn't stable, timeouts won't act right. But not sure that's what happened (ntp happened partway through?)

It says "lws_ssl_capable_write failed: 5" near the end, this seems to imply it was the client side that timed out and closed the tls connection, the server would close the tls connection as part of its general close handling later. It didn't stop because of a server side timeout. So I don't know what the timeout is at the end, the client's timeout?

Somritag commented 3 weeks ago

https_file_serve_sucess_case.txt

Hi Andy , Attached a log where this time it served the file, it happens once in a while , one difference with failed case

TIMEDOUT WAITING 15, dhdr 0, ah 20052630, wl 0

happened in between (line no 3143) but

after closing it reconnects caller: timeout real just_kill_connection A: (sockfd 6) [1970/01/01 00:00:45:0000] I:

[1970/01/01 00:00:45:0000] I: closing [wsisrv|0|adopted|h1|h2]: parent [null wsi] closing child caller: mux child recurse [1970/01/01 00:00:45:0000] D: _realloc: size 49: lws_metrics_histbump (free heap 567895) real just_kill_connection A: (sockfd -1) [1970/01/01 00:00:45:0000] I:

[1970/01/01 00:00:45:0000] I: closing [mux|0|h2sid3(wsisrv|0|adopted)|default|GET|/]: parent [wsisrv|0|adopted|h1|h2] disentangled from sibling [null wsi] real just_kill_connection: sockfd -1 fd 5 events 1 -> 517 lwsi_set_state 0x20000015 -> 0x20000020 cce=1 [1970/01/01 00:00:45:0000] I: rops_destroy_role_h2: [mux|0|h2sid3(wsisrv|0|adopted)|default|GET|/]: ah det due to close [1970/01/01 00:00:45:0000] I: lws_header_table_detach: [mux|0|h2sid3(wsisrv|0|adopted)|default|GET|/]: ah 20053de8 (tsi=0, count = 2) [1970/01/01 00:00:45:0000] D: __lws_header_table_detach: [mux|0|h2sid3(wsisrv|0|adopted)|default|GET|/]: ah held 19s, role/state 0x20000000 0x20, [1970/01/01 00:00:45:0000] I: lws_header_table_detach: nobody usable waiting [1970/01/01 00:00:45:0000] I: _lws_destroy_ah: freed ah 20053de8 : pool length 1 [1970/01/01 00:00:45:0000] I: lws_header_table_detach: [mux|0|h2sid3(wsisrv|0|adopted)|default|GET|/]: ah 20053de8 (tsi=0, count = 1) vh default: count_bound_wsi 2 tsi fds count 2 [1970/01/01 00:00:45:0000] D: SSL_shutdown=1 for fd 6 real just_kill_connection: sockfd 6 fd 5 events 517 -> 517 lwsi_set_state 0x20000119 -> 0x20000020 cce=1 [1970/01/01 00:00:45:0000] I: lws_tls_restrict_return: 1 -> 0 [1970/01/01 00:00:45:0000] N: lws_gate_accepts: on = 0 [1970/01/01 00:00:45:0000] I: rops_destroy_role_h2: [wsisrv|0|adopted|h1|h2]: ah det due to close [1970/01/01 00:00:45:0000] I: lws_header_table_detach: [wsisrv|0|adopted|h1|h2]: ah 20052630 (tsi=0, count = 1) [1970/01/01 00:00:45:0000] D: __lws_header_table_detach: [wsisrv|0|adopted|h1|h2]: ah held 19s, role/state 0x20000000 0x20, [1970/01/01 00:00:45:0000] I: lws_header_table_detach: nobody usable waiting [1970/01/01 00:00:45:0000] I: _lws_destroy_ah: freed ah 20052630 : pool length 0 [1970/01/01 00:00:45:0000] I: lws_header_table_detach: [wsisrv|0|adopted|h1|h2]: ah 20052630 (tsi=0, count = 0) vh default: count_bound_wsi 1 tsi fds count 1 [1970/01/01 00:00:45:0000] D: accepted new conn port 51111 on fd=6 1 29 [1970/01/01 00:00:45:0000] D: _realloc: size 1192: __lws_wsi_create_with_role (free heap 566752)...

And continues ..

But in other failed cases once it losses connection does not revive.

This try was in mozilla though i dont think so its browser dependent .

Somritag commented 3 weeks ago

One more observation defining info.protocols with lws_callback_http vs while we define "http-only" while defining mount shows different behavior. http-only by default points to lws_callback_http_dummy() i suppose does this play role here?

image

lws-team commented 3 weeks ago

The mystery retry can be the browser deciding to have another go because from their pov, it failed partway through (or timed out). It's worth looking at Ctrl-Shift-C on firefox and the network tab to see what it thinks happens on each stream.

All the protocol callbacks should call through to the dummy callback for stuff they didn't handle themselves. I dunno what "different behaviour" means.

Somritag commented 2 weeks ago

image

I see when we try https-> client is issuing a GOAWAY after a FIN set, and then a RST is happening. And No retry (close connection)

But post opening a socket querying the same same page leads to a success

image

Is there any difference the way the https request getting served while socket is open vs socket is not open ?

At times I see retry automatically happens, is there a way to retry in such scenarios?

Somritag commented 1 week ago

Hi Andy, took some time. Indeed, it was a timing out due to improperly setting the lws_now_usecs() and lws_now_secs().

HTTPS server file querry is stable now.

Thanks!