Open davehorton opened 3 months ago
This problem is recreatable against a few different servers:
But:
Any tips on how to troubleshoot this further? I am sure I am missing something obvious, but I cant figure out from the logs exactly what lws doesn't like that is causing it to drop the connection
still looking for help on this one....
If I set a breakpoint where the 'drop protocol' is triggered and examine the wsi, this is what it shows in case this provides any insight to someone that can help..
Thread 8 "freeswitch" hit Breakpoint 4, lws_service_fd_tsi (context=0x7f3230000b70, pollfd=0x7f324cd83020, tsi=0) at /usr/local/src/libwebsockets/lib/core-net/service.c:777
777 lwsl_wsi_debug(wsi, "Close and handled");
(gdb) p wsi
$1 = (struct lws *) 0x7f32300e3ca0
(gdb) p *wsi
$2 = {a = {context = 0x7f3230000b70, vhost = 0x7f3230005ad0, protocol = 0x7f3230005e50, opaque_user_data = 0x0}, http = {new_wsi_list = 0x0, pending_return_headers = 0x0, pending_return_headers_len = 0,
prh_content_length = 0, ah = 0x0, ah_wait_list = 0x0, writeable_len = 0, filepos = 0, filelen = 0, fop_fd = 0x0, multipart_boundary = '\000' <repeats 15 times>, response_code = 0,
request_version = HTTP_VERSION_1_0, conn_type = HTTP_CONNECTION_CLOSE, tx_content_length = 0, tx_content_remain = 0, rx_content_length = 0, rx_content_remain = 0, deferred_transaction_completed = 0,
content_length_explicitly_zero = 0, content_length_given = 0, did_stream_close = 0, multipart = 0, cgi_transaction_complete = 0, multipart_issue_boundary = 0}, h2 = {h2n = 0x0, pending_status_body = 0x0,
h2_state = 0 '\000', END_STREAM = 0 '\000', END_HEADERS = 0 '\000', send_END_STREAM = 0 '\000', long_poll = 0 '\000', initialized = 0 '\000'}, ws = 0x7f32300e4150, mux = {parent_wsi = 0x0, child_list = 0x0,
sibling_list = 0x0, my_sid = 0, child_count = 0, highest_sid = 0, requested_POLLOUT = 0 '\000'}, txc = {tx_cr = 0, peer_tx_cr_est = 0, manual_initial_tx_credit = 0, skint = 0 '\000', manual = 0 '\000'}, lc = {
gutag = "[wsicli|1|WS/h1/default/jambonz-apps.drachtio.org]", '\000' <repeats 13 times>, list = {prev = 0x0, next = 0x0, owner = 0x7f3230001078}, us_creation = 699009227197, log_cx = 0x7f324d5e2f00 <log_cx>},
sul_timeout = {list = {prev = 0x0, next = 0x0, owner = 0x0}, us = 699019233460, cb = 0x7f324d592abd <lws_sul_wsitimeout_cb>, latency_us = 0}, sul_hrtimer = {list = {prev = 0x0, next = 0x0, owner = 0x0}, us = 0,
cb = 0x0, latency_us = 0}, sul_validity = {list = {prev = 0x7f32300f9cd0, next = 0x7f3230000e98, owner = 0x7f3230000e38}, us = 764544286604, cb = 0x7f324d592f96 <lws_validity_cb>, latency_us = 0},
sul_connect_timeout = {list = {prev = 0x0, next = 0x0, owner = 0x0}, us = 0, cb = 0x7f324d597aa8 <lws_client_conn_wait_timeout>, latency_us = 0}, dll_buflist = {prev = 0x0, next = 0x0, owner = 0x0},
same_vh_protocol = {prev = 0x0, next = 0x0, owner = 0x7f3230005ed0}, vh_awaiting_socket = {prev = 0x0, next = 0x0, owner = 0x0}, listen_list = {prev = 0x0, next = 0x0, owner = 0x0}, dll_cli_active_conns = {
prev = 0x0, next = 0x0, owner = 0x0}, dll2_cli_txn_queue = {prev = 0x0, next = 0x0, owner = 0x0}, dll2_cli_txn_queue_owner = {tail = 0x0, head = 0x0, count = 0}, speculative_list = {prev = 0x0, next = 0x0,
owner = 0x0}, speculative_connect_owner = {tail = 0x0, head = 0x0, count = 0}, dns_sorted_list = {tail = 0x0, head = 0x0, count = 0}, sa46_local = {sa4 = {sin_family = 2, sin_port = 32970, sin_addr = {
s_addr = 2265584812}, sin_zero = "\000\000\000\000\000\000\000"}}, sa46_peer = {sa4 = {sin_family = 2, sin_port = 47873, sin_addr = {s_addr = 2944239107}, sin_zero = "\000\000\000\000\000\000\000"}},
parent = 0x0, child_list = 0x0, sibling_list = 0x0, role_ops = 0x7f324d5e1ae0 <role_ops_ws>, seq = 0x0, retry_policy = 0x7f3230000f90, log_cx = 0x0, udp = 0x0, stash = 0x7f32300f1960,
cli_hostname_copy = 0x7f3228002e20 "jambonz-apps.drachtio.org", conmon = {peer46 = {sa4 = {sin_family = 0, sin_port = 0, sin_addr = {s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}}, protocol_specific = {
http = {response = 0}}, dns_results_copy = 0x0, ciu_dns = 694, ciu_sockconn = 861, ciu_tls = 4520, ciu_txn_resp = 53166, pcol = LWSCONMON_PCOL_NONE, dns_disposition = LWSCONMON_DNS_OK},
conmon_datum = 699009233426, user_space = 0x7f3228029010, opaque_parent_data = 0x0, buflist = 0x0, buflist_out = 0x0, tls = {ssl = 0x7f32300e5e00, client_bio = 0x7f32300139d0, dll_pending_tls = {prev = 0x0,
next = 0x0, owner = 0x7f3230000f00}, err_helper = '\000' <repeats 31 times>, use_ssl = 1, redirect_to_https = 0}, alpn = '\000' <repeats 23 times>, desc = {sockfd = 94, filefd = 94}, wsistate = 268435737,
wsistate_pre_close = 0, position_in_fds_table = 2, chunk_remaining = 0, flags = 1, cache_secs = 0, bugcatcher = 0, hdr_parsing_completed = 1, mux_substream = 0, upgraded_to_http2 = 0, mux_stream_immortal = 0,
h2_stream_carries_ws = 0, h2_stream_carries_sse = 0, h2_acked_settings = 0, seen_nonpseudoheader = 0, listener = 0, pf_packet = 0, do_broadcast = 0, user_space_externally_allocated = 0,
socket_is_permanently_unusable = 0, rxflow_change_to = 1, conn_stat_done = 0, cache_reuse = 0, cache_revalidate = 0, cache_intermediaries = 0, favoured_pollin = 0, sending_chunked = 0, interpreting = 0,
already_did_cce = 0, told_user_closed = 0, told_event_loop_closed = 0, waiting_to_send_close_frame = 0, close_needs_ack = 0, ipv6 = 0, parent_pending_cb_on_writable = 0, cgi_stdout_zero_length = 0,
seen_zero_length_recv = 0, rxflow_will_be_applied = 0, event_pipe = 0, handling_404 = 0, protocol_bind_balance = 1, unix_skt = 0, close_when_buffered_out_drained = 0, h1_ws_proxied = 0, proxied_ws_parent = 0,
do_bind = 0, validity_hup = 1, skip_fallback = 0, file_desc = 0, conn_validity_wakesuspend = 0, dns_reachability = 0, could_have_pending = 0, outer_will_close = 0, shadow = 0, tls_borrowed = 1,
tls_borrowed_hs = 0, tls_read_wanted_write = 0, do_ws = 1, chunked = 0, client_rx_avail = 0, client_http_body_pending = 0, transaction_from_pipeline_queue = 0, keepalive_active = 1, keepalive_rejected = 0,
redirected_to_get = 0, client_pipeline = 0, client_h2_alpn = 0, client_mux_substream = 0, client_mux_migrated = 0, client_subsequent_mime_part = 0, client_no_follow_redirect = 0,
client_suppress_CONNECTION_ERROR = 0, tls_session_reused = 0, perf_done = 0, close_is_redirect = 0, client_mux_substream_was = 0, ocport = 443, c_port = 443, conn_port = 443, retry = 0, keep_warm_secs = 5,
lws_rx_parse_state = 20 '\024', rx_frame_type = 0 '\000', pending_timeout = 4 '\004', tsi = 0 '\000', protocol_interpret_idx = 0 '\000', redirects = 0 '\000', rxflow_bitmap = 0 '\000',
bound_vhost_index = 0 '\000', lsp_channel = 0 '\000', chunk_parser = 0 '\000', addrinfo_idx = 0 '\000', sys_tls_client_cert = 0 '\000', c_pri = 0 '\000', af = 0 '\000', reason_bf = 0 '\000', peer_route_uidx = 1,
immortal_substream_count = 0 '\000', handling_pollout = 0 '\000', leave_pollout_active = 0 '\000'}
bit more info..it is bombing out in ops-ws.c line 1233,
if (pending) {
if (lws_is_ws_with_ext(wsi))
pending = pending > wsi->ws->rx_ubuf_alloc ?
wsi->ws->rx_ubuf_alloc : pending;
else
pending = pending > wsi->a.context->pt_serv_buf_size ?
wsi->a.context->pt_serv_buf_size : pending;
if (--sanity)
goto read;
else
/*
* Something has gone wrong, we are spinning...
* let's bail on this connection
*/
return LWS_HPI_RET_PLEASE_CLOSE_ME;
}
I am not sure exactly what "we are spinning means" -- any help here??
Even if my code does nothing but log a message on each incoming binary frame (not even process them) I still get this event/error for the specific server scenarios I mentioned above (and not for the others).
OK I am getting further in understanding this problem. First of all, I mischaracterized the problem above -- this problem has entirely to do with the use of SSL over a websocket (wss). I experience it every time using wss as an lws client, and never when using unencrypted tcp, i.e. ws.
From looking at the code, I can see the problem comes here in the rops_handle_POLLIN_ws function. Having done one read operation in response to the poll event, we call this function to see if there is more data to read.
As we can see here, this always returns 0 in the case of unencrypted websocket connection (tcp). Therefore with tcp we always do one read per poll, and everything works fine.
However, in the case of wss, if SSL_pending returns a non-zero number of pending bytes, then lws loops and does another read. And after the second read it will check again and if once again there is more data to read it will do so again. And so on, until it has reached a limit of 10 reads in the same poll event. At that point it returns the LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL event which closes the socket.
The implication is that in a case like mine, with a server returning a large number of small frames very rapidly (again, this is a streaming audio application), this limit is easily reached and the websocket is closed.
I tested by modifying the code in two different ways, each of which eliminated the problem:
Either of those changes seemed to fix my problem -- the full audio stream is heard and the websocket is not closed.
I don't necessarily consider either of those options to be a good solution though without a better understanding more of why the current code is written like this. This assumption that all pending data can be read within 10 tries, given the sending rate of the server may be quite high, seems a perilous assumption to me. Perhaps I am missing something? Surely others have used lws as a client to high performance streaming servers, is there some option I am not setting or some issue in my code that surfaces this problem for my scenario?
I'd really love some thoughts from the lws team here..
It's like this because on some platforms, loss of network connection to the peer manifests itself as endless 0 length receives.
On other platforms, 0 length tls receive is a legal event unrealted to loss of network connection.
So in an attempt to find a way to deal with both, it has a sanity check for just how often it is spinning and recover from what is otherwise and endless loop by dropping the connection.
It sounds like you can take the approach to reset sanity to 10 or whatever on seeing a positive amount of received data and solve it that way... do you want to try that and propose a patch?
ok, interesting! I did log the length received each time and did receive an occasional zero-length read, but infrequently and not enough to be the cause I don't think. I will do some more investigating..
I'd be interested in testing this as well as we have an application connecting to multiple servers all running libwebsocket.
We are seeing something similar with unexplained LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL
happening sporadically.
We have tweaked the timeout poll intervals and that anecdotally seems to have mitigated, but not eliminated the issue.
We too are sending many small messages over wss.
I've been using lws for years now, and just today I am having a strange problem connecting as a websocket client to a remote server. My code, as client, is streaming audio to that server and that server is also streaming audio to me. Roughly 2 seconds after connecting to this server I consistently see a LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL and the connection is dropped from my side.
I am using lws 4.3.3, and I have a full debug log here.
The last bit of the log leading up to receiving the LWS_CALLBACK_WS_CLIENT_DROP_PROTOCOL looks like this
I've connected with the same application to other servers, and I am trying to gain some insight into why this one fails and lws closes the connection. I'd appreciate any thoughts on what I might be doing wrong...Thanks!