trying to troubleshoot LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL

davehorton commented 3 months ago

I've been using lws for years now, and just today I am having a strange problem connecting as a websocket client to a remote server. My code, as client, is streaming audio to that server and that server is also streaming audio to me. Roughly 2 seconds after connecting to this server I consistently see a LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL and the connection is dropped from my side.

I am using lws 4.3.3, and I have a full debug log here.

The last bit of the log leading up to receiving the LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL looks like this

2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6001] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: lws_ws_handshake_client: hs client feels it has 1040 in

2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6001] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: lws_ws_client_rx_sm: spilling as we filled our rx buffer

2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6001] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: lws_ws_client_rx_sm: starting disbursal of 1024 deframed rx

2024-07-16 19:11:48.586379 98.37% [DEBUG] audio_pipe.cpp:51 AudioPipe::lws_service_thread reason 8
2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: lws_ws_client_rx_sm: bulk ws rx: inp used 1024, output 1024

2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: lws_service_fd_tsi: Close and handled

2024-07-16 19:11:48.586379 98.37% [INFO] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] I: [wsicli|0|WS/h1/default/api.ledniy.trade]: __lws_close_free_wsi: caller: close_and_handled

2024-07-16 19:11:48.586379 98.37% [DEBUG] audio_pipe.cpp:51 AudioPipe::lws_service_thread reason 80
2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: __lws_close_free_wsi: real just_kill_connection A: (sockfd 88)

2024-07-16 19:11:48.586379 98.37% [INFO] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] I: [wsicli|0|WS/h1/default/api.ledniy.trade]: __lws_close_free_wsi: real just_kill_connection: sockfd 88

2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: lwsi_set_state: lwsi_set_state 0x10000119 -> 0x10000020

2024-07-16 19:11:48.586379 98.37% [INFO] lws_glue.cpp:507 [2024/07/16 19:11:48:6002] I: [wsicli|0|WS/h1/default/api.ledniy.trade]: __lws_close_free_wsi: cce=1

2024-07-16 19:11:48.586379 98.37% [DEBUG] audio_pipe.cpp:51 AudioPipe::lws_service_thread reason 75
2024-07-16 19:11:48.586379 98.37% [INFO] lws_glue.cpp:507 [2024/07/16 19:11:48:6004] I: lws_tls_restrict_return: 1 -> 0

2024-07-16 19:11:48.586379 98.37% [NOTICE] lws_glue.cpp:507 [2024/07/16 19:11:48:6004] N: lws_gate_accepts: on = 0

2024-07-16 19:11:48.586379 98.37% [DEBUG] audio_pipe.cpp:51 AudioPipe::lws_service_thread reason 30
2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6004] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: __lws_vhost_unbind_wsi: vh default: count_bound_wsi 0

2024-07-16 19:11:48.586379 98.37% [DEBUG] lws_glue.cpp:507 [2024/07/16 19:11:48:6004] D: [wsicli|0|WS/h1/default/api.ledniy.trade]: __lws_free_wsi: tsi fds count 2

2024-07-16 19:11:48.586379 98.37% [NOTICE] lws_glue.cpp:507 [2024/07/16 19:11:48:6004] N: __lws_lc_untag:  -- [wsicli|0|WS/h1/default/api.ledniy.trade] (0) 1.713s

I've connected with the same application to other servers, and I am trying to gain some insight into why this one fails and lws closes the connection. I'd appreciate any thoughts on what I might be doing wrong...Thanks!

davehorton commented 3 months ago

This problem is recreatable against a few different servers:

if I connect to a websocket server on cloudflare -- fails with the LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL event
if I connect to a websocket server running over ngrok and https on my laptop, also fails

But:

if I connect to a websocket server running over ngrok and http on my laptop, works without a problem
If I connect to a websocket server running over https on a server/VM with a public IP (not thru cloudflare), works without a problem

Any tips on how to troubleshoot this further? I am sure I am missing something obvious, but I cant figure out from the logs exactly what lws doesn't like that is causing it to drop the connection

davehorton commented 3 months ago

still looking for help on this one....

If I set a breakpoint where the 'drop protocol' is triggered and examine the wsi, this is what it shows in case this provides any insight to someone that can help..

Thread 8 "freeswitch" hit Breakpoint 4, lws_service_fd_tsi (context=0x7f3230000b70, pollfd=0x7f324cd83020, tsi=0) at /usr/local/src/libwebsockets/lib/core-net/service.c:777
777         lwsl_wsi_debug(wsi, "Close and handled");
(gdb) p wsi
$1 = (struct lws *) 0x7f32300e3ca0
(gdb) p *wsi
$2 = {a = {context = 0x7f3230000b70, vhost = 0x7f3230005ad0, protocol = 0x7f3230005e50, opaque_user_data = 0x0}, http = {new_wsi_list = 0x0, pending_return_headers = 0x0, pending_return_headers_len = 0,
    prh_content_length = 0, ah = 0x0, ah_wait_list = 0x0, writeable_len = 0, filepos = 0, filelen = 0, fop_fd = 0x0, multipart_boundary = '\000' <repeats 15 times>, response_code = 0,
    request_version = HTTP_VERSION_1_0, conn_type = HTTP_CONNECTION_CLOSE, tx_content_length = 0, tx_content_remain = 0, rx_content_length = 0, rx_content_remain = 0, deferred_transaction_completed = 0,
    content_length_explicitly_zero = 0, content_length_given = 0, did_stream_close = 0, multipart = 0, cgi_transaction_complete = 0, multipart_issue_boundary = 0}, h2 = {h2n = 0x0, pending_status_body = 0x0,
    h2_state = 0 '\000', END_STREAM = 0 '\000', END_HEADERS = 0 '\000', send_END_STREAM = 0 '\000', long_poll = 0 '\000', initialized = 0 '\000'}, ws = 0x7f32300e4150, mux = {parent_wsi = 0x0, child_list = 0x0,
    sibling_list = 0x0, my_sid = 0, child_count = 0, highest_sid = 0, requested_POLLOUT = 0 '\000'}, txc = {tx_cr = 0, peer_tx_cr_est = 0, manual_initial_tx_credit = 0, skint = 0 '\000', manual = 0 '\000'}, lc = {
    gutag = "[wsicli|1|WS/h1/default/jambonz-apps.drachtio.org]", '\000' <repeats 13 times>, list = {prev = 0x0, next = 0x0, owner = 0x7f3230001078}, us_creation = 699009227197, log_cx = 0x7f324d5e2f00 <log_cx>},
  sul_timeout = {list = {prev = 0x0, next = 0x0, owner = 0x0}, us = 699019233460, cb = 0x7f324d592abd <lws_sul_wsitimeout_cb>, latency_us = 0}, sul_hrtimer = {list = {prev = 0x0, next = 0x0, owner = 0x0}, us = 0,
    cb = 0x0, latency_us = 0}, sul_validity = {list = {prev = 0x7f32300f9cd0, next = 0x7f3230000e98, owner = 0x7f3230000e38}, us = 764544286604, cb = 0x7f324d592f96 <lws_validity_cb>, latency_us = 0},
  sul_connect_timeout = {list = {prev = 0x0, next = 0x0, owner = 0x0}, us = 0, cb = 0x7f324d597aa8 <lws_client_conn_wait_timeout>, latency_us = 0}, dll_buflist = {prev = 0x0, next = 0x0, owner = 0x0},
  same_vh_protocol = {prev = 0x0, next = 0x0, owner = 0x7f3230005ed0}, vh_awaiting_socket = {prev = 0x0, next = 0x0, owner = 0x0}, listen_list = {prev = 0x0, next = 0x0, owner = 0x0}, dll_cli_active_conns = {
    prev = 0x0, next = 0x0, owner = 0x0}, dll2_cli_txn_queue = {prev = 0x0, next = 0x0, owner = 0x0}, dll2_cli_txn_queue_owner = {tail = 0x0, head = 0x0, count = 0}, speculative_list = {prev = 0x0, next = 0x0,
    owner = 0x0}, speculative_connect_owner = {tail = 0x0, head = 0x0, count = 0}, dns_sorted_list = {tail = 0x0, head = 0x0, count = 0}, sa46_local = {sa4 = {sin_family = 2, sin_port = 32970, sin_addr = {
        s_addr = 2265584812}, sin_zero = "\000\000\000\000\000\000\000"}}, sa46_peer = {sa4 = {sin_family = 2, sin_port = 47873, sin_addr = {s_addr = 2944239107}, sin_zero = "\000\000\000\000\000\000\000"}},
  parent = 0x0, child_list = 0x0, sibling_list = 0x0, role_ops = 0x7f324d5e1ae0 <role_ops_ws>, seq = 0x0, retry_policy = 0x7f3230000f90, log_cx = 0x0, udp = 0x0, stash = 0x7f32300f1960,
  cli_hostname_copy = 0x7f3228002e20 "jambonz-apps.drachtio.org", conmon = {peer46 = {sa4 = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}}, protocol_specific = {
      http = {response = 0}}, dns_results_copy = 0x0, ciu_dns = 694, ciu_sockconn = 861, ciu_tls = 4520, ciu_txn_resp = 53166, pcol = LWSCONMON_PCOL_NONE, dns_disposition = LWSCONMON_DNS_OK},
  conmon_datum = 699009233426, user_space = 0x7f3228029010, opaque_parent_data = 0x0, buflist = 0x0, buflist_out = 0x0, tls = {ssl = 0x7f32300e5e00, client_bio = 0x7f32300139d0, dll_pending_tls = {prev = 0x0,
      next = 0x0, owner = 0x7f3230000f00}, err_helper = '\000' <repeats 31 times>, use_ssl = 1, redirect_to_https = 0}, alpn = '\000' <repeats 23 times>, desc = {sockfd = 94, filefd = 94}, wsistate = 268435737,
  wsistate_pre_close = 0, position_in_fds_table = 2, chunk_remaining = 0, flags = 1, cache_secs = 0, bugcatcher = 0, hdr_parsing_completed = 1, mux_substream = 0, upgraded_to_http2 = 0, mux_stream_immortal = 0,
  h2_stream_carries_ws = 0, h2_stream_carries_sse = 0, h2_acked_settings = 0, seen_nonpseudoheader = 0, listener = 0, pf_packet = 0, do_broadcast = 0, user_space_externally_allocated = 0,
  socket_is_permanently_unusable = 0, rxflow_change_to = 1, conn_stat_done = 0, cache_reuse = 0, cache_revalidate = 0, cache_intermediaries = 0, favoured_pollin = 0, sending_chunked = 0, interpreting = 0,
  already_did_cce = 0, told_user_closed = 0, told_event_loop_closed = 0, waiting_to_send_close_frame = 0, close_needs_ack = 0, ipv6 = 0, parent_pending_cb_on_writable = 0, cgi_stdout_zero_length = 0,
  seen_zero_length_recv = 0, rxflow_will_be_applied = 0, event_pipe = 0, handling_404 = 0, protocol_bind_balance = 1, unix_skt = 0, close_when_buffered_out_drained = 0, h1_ws_proxied = 0, proxied_ws_parent = 0,
  do_bind = 0, validity_hup = 1, skip_fallback = 0, file_desc = 0, conn_validity_wakesuspend = 0, dns_reachability = 0, could_have_pending = 0, outer_will_close = 0, shadow = 0, tls_borrowed = 1,
  tls_borrowed_hs = 0, tls_read_wanted_write = 0, do_ws = 1, chunked = 0, client_rx_avail = 0, client_http_body_pending = 0, transaction_from_pipeline_queue = 0, keepalive_active = 1, keepalive_rejected = 0,
  redirected_to_get = 0, client_pipeline = 0, client_h2_alpn = 0, client_mux_substream = 0, client_mux_migrated = 0, client_subsequent_mime_part = 0, client_no_follow_redirect = 0,
  client_suppress_CONNECTION_ERROR = 0, tls_session_reused = 0, perf_done = 0, close_is_redirect = 0, client_mux_substream_was = 0, ocport = 443, c_port = 443, conn_port = 443, retry = 0, keep_warm_secs = 5,
  lws_rx_parse_state = 20 '\024', rx_frame_type = 0 '\000', pending_timeout = 4 '\004', tsi = 0 '\000', protocol_interpret_idx = 0 '\000', redirects = 0 '\000', rxflow_bitmap = 0 '\000',
  bound_vhost_index = 0 '\000', lsp_channel = 0 '\000', chunk_parser = 0 '\000', addrinfo_idx = 0 '\000', sys_tls_client_cert = 0 '\000', c_pri = 0 '\000', af = 0 '\000', reason_bf = 0 '\000', peer_route_uidx = 1,
  immortal_substream_count = 0 '\000', handling_pollout = 0 '\000', leave_pollout_active = 0 '\000'}

davehorton commented 3 months ago

bit more info..it is bombing out in ops-ws.c line 1233,

    if (pending) {
        if (lws_is_ws_with_ext(wsi))
            pending = pending > wsi->ws->rx_ubuf_alloc ?
                wsi->ws->rx_ubuf_alloc : pending;
        else
            pending = pending > wsi->a.context->pt_serv_buf_size ?
                wsi->a.context->pt_serv_buf_size : pending;
        if (--sanity)
            goto read;
        else
            /*
             * Something has gone wrong, we are spinning...
             * let's bail on this connection
             */
            return LWS_HPI_RET_PLEASE_CLOSE_ME;
    }

I am not sure exactly what "we are spinning means" -- any help here??

Even if my code does nothing but log a message on each incoming binary frame (not even process them) I still get this event/error for the specific server scenarios I mentioned above (and not for the others).

davehorton commented 2 months ago

OK I am getting further in understanding this problem. First of all, I mischaracterized the problem above -- this problem has entirely to do with the use of SSL over a websocket (wss). I experience it every time using wss as an lws client, and never when using unencrypted tcp, i.e. ws.

From looking at the code, I can see the problem comes here in the rops_handle_POLLIN_ws function. Having done one read operation in response to the poll event, we call this function to see if there is more data to read.

As we can see here, this always returns 0 in the case of unencrypted websocket connection (tcp). Therefore with tcp we always do one read per poll, and everything works fine.

However, in the case of wss, if SSL_pending returns a non-zero number of pending bytes, then lws loops and does another read. And after the second read it will check again and if once again there is more data to read it will do so again. And so on, until it has reached a limit of 10 reads in the same poll event. At that point it returns the LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL event which closes the socket.

The implication is that in a case like mine, with a server returning a large number of small frames very rapidly (again, this is a streaming audio application), this limit is easily reached and the websocket is closed.

I tested by modifying the code in two different ways, each of which eliminated the problem:

Enforce 1 read per poll for the wss case, same as for ws.
If the limit of 10 reads in a single poll event is reached, just return normally (LWS_HPI_RET_HANDLED) and let the pending data be handled on the next read.

Either of those changes seemed to fix my problem -- the full audio stream is heard and the websocket is not closed.

I don't necessarily consider either of those options to be a good solution though without a better understanding more of why the current code is written like this. This assumption that all pending data can be read within 10 tries, given the sending rate of the server may be quite high, seems a perilous assumption to me. Perhaps I am missing something? Surely others have used lws as a client to high performance streaming servers, is there some option I am not setting or some issue in my code that surfaces this problem for my scenario?

I'd really love some thoughts from the lws team here..

lws-team commented 2 months ago

It's like this because on some platforms, loss of network connection to the peer manifests itself as endless 0 length receives.

On other platforms, 0 length tls receive is a legal event unrealted to loss of network connection.

So in an attempt to find a way to deal with both, it has a sanity check for just how often it is spinning and recover from what is otherwise and endless loop by dropping the connection.

It sounds like you can take the approach to reset sanity to 10 or whatever on seeing a positive amount of received data and solve it that way... do you want to try that and propose a patch?

davehorton commented 2 months ago

ok, interesting! I did log the length received each time and did receive an occasional zero-length read, but infrequently and not enough to be the cause I don't think. I will do some more investigating..

lailoken commented 2 months ago

I'd be interested in testing this as well as we have an application connecting to multiple servers all running libwebsocket. We are seeing something similar with unexplained LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL happening sporadically. We have tweaked the timeout poll intervals and that anecdotally seems to have mitigated, but not eliminated the issue.

We too are sending many small messages over wss.

warmcat / libwebsockets

trying to troubleshoot LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL #3185