icing / mod_h2

HTTP/2 module for Apache httpd
https://icing.github.io/mod_h2/
Apache License 2.0
256 stars 41 forks source link

infinite loop on windows graceful restart with proxy websocket #277

Open nono303 opened 6 months ago

nono303 commented 6 months ago

Hi @icing I have a bug with mod_h2 and "proxyfied" websockets when restarting the service (graceful I suppose) on Windows This problem reminds me of https://bz.apache.org/bugzilla/show_bug.cgi?id=65180 but is not fully identical.

httpd 2.4.58 self compiled vs17 x64 mod_h2 2.0.26

ProtocolsHonorOrder On Protocols h2 h2c http/1.1 acme-tls/1 H2Upgrade On H2Direct On H2Push On H2EarlyHints On H2ModernTLSOnly On H2WebSockets Off # as unsupported on Windows H2WindowSize 1460000000 H2StreamMaxMemSize 1460000000 H2MinWorkers 8 H2MaxWorkers 100 H2MaxSessionStreams 100 H2MaxDataFrameLen 0

  1. I have 3 vhosts configured (grafana, gitlab and mattermost) like this (with different ports...)

    ProxyPass / http://127.0.0.1:7000/ upgrade=websocket nocanon retry=0 keepalive=On
    ProxyPassReverse / http://127.0.0.1:7000/

    Note: I did the test described below by deactivating nocanon retry=0 keepalive=On properties one by one but that doesn't change anything.

  2. I start httpd and launch my browser tabs which requires a websocket connection

    • ✅ httpd fork is ~100% idle
    • F12 101 Switching Protocols for all and I see the exchanges in the console
      wss://gitlab.mydomain.com/-/cable
      wss://mattermost.mydomain.com/api/v4/websocket?connection_id=&sequence_number=0
      wss://grafana.mydomain.com/api/live/ws  
  3. I restart httpd (httpd -k restart) while still having my 3 tabs open (websocket connected)

    • ⚠️ the newly created httpd fork is a ~25% sys and stay as it
  4. I attach debugger and find an infinite loop on the 3 threads on this stack:

    ntdll!NtDeviceIoControlFile+0x14
    MSWSOCK!MSAFD_WSPPoll+0x30e
    MSWSOCK!WSPIoctl+0xed11()
    WS2_32!WSAIoctl+0x1c7()
    WS2_32!WSAPoll+0x22f
    libapr_1!apr_winapi_WSAPoll+0x1b(pollfd * fdArray = 0x1778f464738) [C:\sdk\src\apr\include\arch\win32\apr_arch_misc.h @ 477]
    libapr_1!impl_pollset_poll+0xc5(apr_pollset_t * pollset = 0x1778f4646c0, __int64 timeout = 0, int * num = 0x134bfffb78 : 0, apr_pollfd_t * * descriptors = 0x134bfffb60, apr_pollfd_t * * descriptors = 0x134bfffb60) [C:\sdk\src\apr\poll\unix\poll.c @ 263]
    mod_http2!mplx_pollset_poll+0x136(h2_mplx * m = 0x1778f44c7d8, __int64 timeout = 0, void * on_ctx = 0x1778f44c160, void (__cdecl*)(void *,h2_stream *) on_stream_input = 0x44894c20244c894c : 0x44894c20244c894c, void (__cdecl*)(void *,h2_stream *) on_stream_output =    0x44894c20244c894c : 0x44894c20244c894c) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 1210]
    mod_http2!h2_mplx_c1_poll+0x40(h2_mplx * m = 0x1778f44c7d8) [C:\sdk\src\mod_h2\mod_http2\h2_mplx.c @ 637]
    mod_http2!h2_session_process+0xce3(h2_session * session = 0x1778f44c160, int async = 610044232) [C:\sdk\src\mod_h2\mod_http2\h2_session.c @ 1911]
    mod_http2!h2_c1_run+0xaf(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 132]
    mod_http2!h2_c1_hook_process_connection+0x4e3(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\mod_h2\mod_http2\h2_c1.c @ 279]
    libhttpd!ap_run_process_connection+0x35(conn_rec * c = 0x1778f5004d8) [C:\sdk\src\httpd\server\connection.c @ 42]
    libhttpd!ap_process_connection+0x57(void * csd = 0x1778f5002c0) [C:\sdk\src\httpd\server\connection.c @ 217]
    libhttpd!worker_main+0x3a8(void * thread_num_val = 0xc5) [C:\sdk\src\httpd\server\mpm\winnt\child.c @ 846]
    KERNEL32!BaseThreadInitThunk+0x1d()
    ntdll!RtlUserThreadStart+0x28

    ...and to be clear about it...

    diff --git "a/mod_http2/h2_mplx.c" "b/mod_http2/h2_mplx.c"
    index 9f9ce80..1823d5c 100644
    --- "a/mod_http2/h2_mplx.c"
    +++ "b/mod_http2/h2_mplx.c"
    @@ -1207,6 +1207,7 @@ static apr_status_t mplx_pollset_poll(h2_mplx *m, apr_interval_time_t timeout,
    
             H2_MPLX_LEAVE(m);
             rv = apr_pollset_poll(m->pollset, timeout >= 0? timeout : -1, &nresults, &results);
    +           ap_log_cerror(APLOG_MARK, APLOG_ERR, rv, m->c1, APLOGNO(10310) H2_MPLX_MSG(m, "###### "));
             H2_MPLX_ENTER_ALWAYS(m);
             if (APR_STATUS_IS_EINTR(rv) && m->shutdown) {
                 if (!m->aborted) {

    indeed, log show an infinite loop after graceful restart..

    So

  5. did I configure something wrong? (If you need other information about my configuration, I will give it to you)

  6. if we need to investigate further, tell me what to do

  7. ⚠️ by starting (httpd -k start) with only changing to H2Upgrade Off, I directly fall into the infinite loop

icing commented 6 months ago

I am a bit confused. Does the browser make WebSocket connections over HTTP/2? It should not, if the server does not support it. Is this another request that produces the infinite loop? Strange.

Can you produce a log with LogLevel http2:trace2 on such a situation?

nono303 commented 6 months ago

I am a bit confused

same for me cause my big picture understanding is that in this situation, h2_switch.c would have to decline and forward to mod_proxy_wstunnel....

Does the browser make WebSocket connections over HTTP/2

no, over HTTP/1.1 2024-04-02 17:52:06.551 proxy-server "GET /grafana/api/live/ws HTTP/1.1" 101 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0" sI:2676 sO:->0:-:- sT:3657 tI:- tO:24711096 ka:0-

Can you produce a log with LogLevel http2:trace2 on such a situation?

just have this relevant: 2024-04-02 17:52:06.551648 grafana.mydomain.com 127.0.0.1:50694 ZgwppqjK1zgVxTggMqo8iAAAAMM http2:debug h2_switch.c(92) pid:13140 tid:4048 ka:0 [AH03085: upgrade without HTTP2-Settings declined] full trace after a graceful restart: http2-trace2.txt

nono303 commented 6 months ago

Hi @icing, Continuing digging on it…

To be clear:

But as you said, I’m not sure that the infinite loop is originated by a gracefully restarted websocket cnx! As server-status scoreboard doesn’t show tid in mpm_winnt mode (only pid of main process) and as I don’t see issue in scoreboard nor broken cnx in browser network call stack, I suspect to have some old connection (from previous httpd forked process before graceful restart) to be stuck

As it’s difficult to work on Windows with tracing on a “kill & up process fork” and as I’m not fluent with APR, would it be possible to give a tmp patch (better than mine... easy ;) to trace which request are stuck in the infinite loop? The idea is to quickly have maximum of information (tid, request uri rtime, etc.) when it's stuck in do while greater than n time (like 1000 or whatever) and exit.

icing commented 5 months ago

If mod_http2 is indeed polling in a loop, as your stacktrace indicates, something weird is going on. The trace is a call from h2_session.c:1911 where the session is in state H2_SESSION_ST_BUSY. When the poll returns, it calls h2_c1_read(session). That call should either get data from the client or move the session to another state.

I think we really need a log from such a case with LogLevel http2:debug to see what state changes the HTTP/2 session does.

nono303 commented 4 months ago

Hi @icing,I'm back on it Yesterday I put the http2:debug logs and reactivated the configuration Protocols h2 http/1.1 acme-tls/1 Surprise! No infinite loop has been triggered in the last 12 hours (and the usage /throughput is globally the same)

Originally, just had to wait few second after a graceful to see 1 to 4 thread in infinite loop stae regarding cpu usage

Changes since last month:

I'll keep you posted...

icing commented 4 months ago

Thanks for the update. This is a mysterious one indeed.

nono303 commented 4 months ago

Hi @icing, I think I've caught the loop Use case:

  1. start httpd
  2. launch browser tabs for mattermost, grafana, webmail & mainsite (4 different vhost proxing different backend expect mainsite)
    [AH03079: h2_session(10292-0,INIT,0): started on mattermost:443]
    [AH03079: h2_session(10292-1,INIT,0): started on webmail:443]
    [AH03079: h2_session(10292-2,INIT,0): started on grafana:443]
    [AH03079: h2_session(10292-3,INIT,0): started on mainsite:443]
    [AH03079: h2_session(10292-5,INIT,0): started on mattermost:443]
    [AH03079: h2_session(10292-6,INIT,0): started on mattermost:443]
    [AH03079: h2_session(10292-7,INIT,0): started on grafana:443]
    [AH03079: h2_session(10292-8,INIT,0): started on mainsite:443]
    • all are well exchanging in h2
    • mattermost & mrafana have a "101" upgrade websocket http1./1 active

      mattermost & mrafana proxy config

      RequestHeader set X-Forwarded-Proto "https"
      RequestHeader set X-Forwarded-Ssl "on"
      ProxyPass / http://backend/ upgrade=websocket retry=0 keepalive=On timeout=10
      ProxyPassReverse / http://backend /
  3. few seconds later, a worker (Windows thread) is 100% cpu for ~1 minute
  4. waiting it to stop looping
  5. kill browser
  6. wait 5 minutes
    • Server-status : http/1.1 websocket workers still in "W" Sending Reply
    • Netstat : cnx httpd <> mattermost still in close wait
      [httpd.exe] TCP    httpd.ip:59709     mattermost.ip:8065    CLOSE_WAIT
      [httpd.exe] TCP    httpd.ip:59765     mattermost.ip:8065    CLOSE_WAIT
      [httpd.exe] TCP    httpd.ip:59795     mattermost.ip:8065    CLOSE_WAIT
      [httpd.exe] TCP    httpd.ip:59798     mattermost.ip:8065    CLOSE_WAIT
      [httpd.exe] TCP    httpd.ip:59799     mattermost.ip:8065    CLOSE_WAIT
      [httpd.exe] TCP    httpd.ip:59802     mattermost.ip:8065    CLOSE_WAIT

      Here is the full 2 minutes httpd-error.log Hard to read for me but... We see the 4 websocket connection declined [AH03085: upgrade without HTTP2-Settings declined] So hypothesis that I don't control... Is there any chance that a pointer of these declined connections is still referenced in the h2_session and produce a "busy" zombie I say this after reading the lines below which correspond almost to the moment when the infinite loop ended

      2024-05-28 08:02:22.967261 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
      2024-05-28 08:02:23.429743 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
      2024-05-28 08:02:23.451635 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]
      2024-05-28 08:02:23.527443 - client_ip:60500 - http2:debug h2_session.c(1393) pid:10292 tid:3784 ka:0 [AH03078: h2_session(10292-5,BUSY,0): transit [BUSY] -- input exhausted, no streams --> [IDLE]]

      That's the context. If you need more traces, information... just tell me