randomly failing requests with http2 and server sent events

daum3ns commented 2 years ago

hello,

we face a strange behavior where some http2 requests fail with the error

2022-04-26 17:23:41.384625 http2:debug 172.22.1.196:55080 AH03466: h2_stream(270-9,HALF_CLOSED_REMOTE): no response, RST_STREAM

unfortunately it seems to be completely random when/which requests fail. we can force it by clicking around in the web application, but we were not able identify a clear way to reproduce. even exactly identical requests from the client sometimes work and sometimes fail. this makes it very difficult to compare "good" cases with "bad" cases...

httpd is version 2.4.53, mod_http2 v1.15.26 (apache is acting as a reverse proxy to a back end application server..)

what we can tell so far is:

everything works fine when we switch to http1 (with SSE on the back end enabled)
when we disable the SSE functionality on the back end, everything works well with http2
when we enable SSE, some requests fail with http2.

we have tested different H2 directives like increasing workers, window size and timeouts, but these all don't seem to have an impact on the issue...

to track the issue down we have enabled tracing in the module:

2022-04-26 17:23:41.384572 http2:trace1 172.22.1.196:55080 AH03078: h2_session(270,BUSY,10): transit [WAIT] -- stream change --> [BUSY]
2022-04-26 17:23:41.384578 http2:trace1 172.22.1.196:55080 h2_session(270): read, NONBLOCK_READ, mode=0, readbytes=327680
2022-04-26 17:23:41.384596 http2:trace1 172.22.1.196:55080 h2_session(270): read
2022-04-26 17:23:41.384600 http2:trace2 172.22.1.196:55080 h2_mplx(270): dispatch events
2022-04-26 17:23:41.384604 http2:trace2 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): on_resume
2022-04-26 17:23:41.384608 http2:trace2 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): out-buffer(add_buffered_data())
2022-04-26 17:23:41.384611 http2:trace2 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): out-buffer(pre())
2022-04-26 17:23:41.384613 http2:trace2 172.22.1.196:55080 beam(270-9,output,closed=1,aborted=1,empty=1,buf=0): pre read output
2022-04-26 17:23:41.384617 http2:trace2 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): out-buffer(post())
2022-04-26 17:23:41.384619 http2:trace2 172.22.1.196:55080 beam(270-9,output,closed=1,aborted=1,empty=1,buf=0): post read output
2022-04-26 17:23:41.384622 http2:trace2 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): prepared len=0, eos=0
2022-04-26 17:23:41.384625 http2:debug 172.22.1.196:55080 AH03466: h2_stream(270-9,HALF_CLOSED_REMOTE): no response, RST_STREAM
2022-04-26 17:23:41.384627 http2:trace1 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): reset, error=1
2022-04-26 17:23:41.384630 http2:trace2 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): dispatch event 2
2022-04-26 17:23:41.384633 http2:trace1 172.22.1.196:55080 h2_stream(270-9,HALF_CLOSED_REMOTE): transit to [CLOSED]
2022-04-26 17:23:41.384636 http2:trace1 172.22.1.196:55080 h2_stream(270-9,CLOSED): closing input
2022-04-26 17:23:41.384639 http2:trace2 172.22.1.196:55080 h2_stream(270-9,CLOSED): entered state
2022-04-26 17:23:41.384643 http2:debug 172.22.1.196:55080 AH03068: h2_session(270,BUSY,9): sent FRAME[RST_STREAM[length=4, flags=0, stream=9]], frames=18/13 (r/s)
2022-04-26 17:23:41.384648 http2:trace1 172.22.1.196:55080 h2_stream(270-9,CLOSED): send frame 3, eos=0
2022-04-26 17:23:41.384651 http2:debug 172.22.1.196:55080 AH03065: h2_stream(270-9,CLOSED): closing with err=1 protocol error
2022-04-26 17:23:41.384654 http2:trace1 172.22.1.196:55080 h2_stream(270-9,CLOSED): reset, error=1
2022-04-26 17:23:41.384656 http2:trace2 172.22.1.196:55080 h2_stream(270-9,CLOSED): dispatch event 2
2022-04-26 17:23:41.384659 http2:trace2 172.22.1.196:55080 h2_stream(270-9,CLOSED): non-state event 2
2022-04-26 17:23:41.384664 http2:trace2 172.22.1.196:55080 h2_session(270)-out: h2eos heap[13] flush 
2022-04-26 17:23:41.384669 http2:trace2 172.22.1.196:55080 h2_stream(270-9,CLOSED): dispatch event 3
2022-04-26 17:23:41.384672 http2:trace1 172.22.1.196:55080 h2_stream(270-9,CLOSED): transit to [CLEANUP]
2022-04-26 17:23:41.384675 http2:trace2 172.22.1.196:55080 h2_stream(270-9,CLEANUP): entered state
2022-04-26 17:23:41.384677 http2:trace2 172.22.1.196:55080 h2_stream(270-9,CLEANUP): cleanup

what i found out of this is:

the log line

2022-04-26 17:23:41.384613 http2:trace2 172.22.1.196:55080 beam(270-9,output,closed=1,aborted=1,empty=1,buf=0): pre read output

originates from the function h2_stream_out_prepare here , so this means that at the time the function h2_beam_receive (the next line in the linked code) is called, the aborted=1 flag is already set. so when i have a look at the code there, i think this results in the function returning the status APR_ECONNABORTED. as far as i understand this also means that h2_stream_out_prepare returns this status to its caller on_stream_resume in h2_session.c (the caller is visible from the trace log "on_resume"). this condition is then false, because otherwise we would see more output in the trace log. (originated from add_buffered_data function in h2_stream.c)

so i think all that means, that the next two conditions are true :

    else if (status != APR_EAGAIN) {
        /* we have DATA to send */
        if (!stream->has_response) {
            /* but no response */

because of the returned APR_ECONNABORTED,
because h2_beam_receive was immediately aborted there is of course no response...

so for me, the main question is, why is aborted=1 so early ? what does it mean "we have data to send, but no response"?

maybe the condition status != APR_EAGAIN is wrong, or maybe the case that the state at this point is APR_ECONNABORTED should not even ever happen?

for completeness, a trace log of a "good" case, from the comparison i can't really see much differences (except the aborted=1 flag... )

2022-04-26 17:23:41.457199 http2:trace1 172.22.1.196:55080 AH03078: h2_session(270,BUSY,7): transit [WAIT] -- stream change --> [BUSY]
2022-04-26 17:23:41.457206 http2:trace1 172.22.1.196:55080 h2_session(270): read, NONBLOCK_READ, mode=0, readbytes=229376
2022-04-26 17:23:41.457221 http2:trace1 172.22.1.196:55080 h2_session(270): read
2022-04-26 17:23:41.457224 http2:trace2 172.22.1.196:55080 h2_mplx(270): dispatch events
2022-04-26 17:23:41.457228 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): on_resume
2022-04-26 17:23:41.457232 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(add_buffered_data())
2022-04-26 17:23:41.457235 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(pre())
2022-04-26 17:23:41.457238 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=0,buf=866): pre read output
2022-04-26 17:23:41.457246 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(post(H2HEADERS BEAM[343] FLUSH))
2022-04-26 17:23:41.457249 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): post read output
2022-04-26 17:23:41.457252 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(add_buffered_data(H2HEADERS BEAM[343] FLUSH))
2022-04-26 17:23:41.457255 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prep, -> response 200
2022-04-26 17:23:41.457257 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepare, no data
2022-04-26 17:23:41.457260 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepared len=0, eos=0
2022-04-26 17:23:41.457262 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): on_headers
2022-04-26 17:23:41.457265 http2:debug 172.22.1.196:55080 AH03073: h2_stream(270-13,HALF_CLOSED_REMOTE): submit response 200, REMOTE_WINDOW_SIZE=6291456
2022-04-26 17:23:41.457279 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(add_buffered_data(BEAM[343] FLUSH))
2022-04-26 17:23:41.457282 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(pre(BEAM[343]))
2022-04-26 17:23:41.457285 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): pre read output
2022-04-26 17:23:41.457288 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(post(BEAM[343]))
2022-04-26 17:23:41.457290 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): post read output
2022-04-26 17:23:41.457293 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepare, len=343 eos=0
2022-04-26 17:23:41.457296 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepared len=343, eos=0
2022-04-26 17:23:41.457298 http2:debug 172.22.1.196:55080 AH02936: h2_stream(270-13,HALF_CLOSED_REMOTE): resumed
2022-04-26 17:23:41.457309 http2:debug 172.22.1.196:55080 AH03068: h2_session(270,BUSY,7): sent FRAME[HEADERS[length=33, hend=1, stream=13, eos=0]], frames=18/19 (r/s)
2022-04-26 17:23:41.457312 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): send frame 1, eos=0
2022-04-26 17:23:41.457318 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(add_buffered_data(BEAM[343]))
2022-04-26 17:23:41.457323 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(pre(BEAM[343]))
2022-04-26 17:23:41.457326 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): pre read output
2022-04-26 17:23:41.457329 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(post(BEAM[343]))
2022-04-26 17:23:41.457331 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): post read output
2022-04-26 17:23:41.457334 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepare, len=343 eos=0
2022-04-26 17:23:41.457336 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepared no_copy, len=343, eos=0
2022-04-26 17:23:41.457340 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): send_data_cb for 343 bytes
2022-04-26 17:23:41.457342 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): read_to, len=343 eos=0
2022-04-26 17:23:41.457346 http2:debug 172.22.1.196:55080 AH03068: h2_session(270,BUSY,7): sent FRAME[DATA[length=343, flags=0, stream=13, padlen=0]], frames=18/20 (r/s)
2022-04-26 17:23:41.457348 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): send frame 0, eos=0
2022-04-26 17:23:41.457352 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(add_buffered_data())
2022-04-26 17:23:41.457354 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(pre())
2022-04-26 17:23:41.457357 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): pre read output
2022-04-26 17:23:41.457360 http2:trace2 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): out-buffer(post())
2022-04-26 17:23:41.457362 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): post read output
2022-04-26 17:23:41.457364 http2:trace1 172.22.1.196:55080 h2_stream(270-13,HALF_CLOSED_REMOTE): prepare, no data
2022-04-26 17:23:41.457367 http2:debug 172.22.1.196:55080 AH03071: h2_stream(270-13,HALF_CLOSED_REMOTE): suspending
2022-04-26 17:23:41.457371 http2:trace2 172.22.1.196:55080 h2_session(270)-out: heap[394] flush 
2022-04-26 17:23:41.457403 http2:trace2 172.22.1.196:55080 bb_dump(270-13): h2_task send_out(EOS)
2022-04-26 17:23:41.457408 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=0,aborted=0,empty=1,buf=0): send_out(before)
2022-04-26 17:23:41.457413 http2:trace2 172.22.1.196:55080 beam(270-13,output,closed=1,aborted=0,empty=0,buf=0): send_out(after)
2022-04-26 17:23:41.457416 http2:trace2 172.22.1.196:55080 h2_task(270-13): send_out done
2022-04-26 17:23:41.457418 http2:trace2 172.22.1.196:55080 h2_secondary_out(270-13): buffered=1

(the stream then continues)

any ideas what could be the issue here, or what we can do to track it more down can be helpful. also maybe you can read out something more of these trace logs with your detailed knowledge of the module.

the fact that it is not clearly reproducible, and that everything works fine without server sent events somehow lets me think of a race condition or something similar happening when back end sends these events. on the other hand it works with http1, so this is a clear indicator that the mod_http2 is causing the error.

also, its very strange that the server sent events somehow have an impact on other requests made from the client. because the actual request causing the error is not an SSE response its a "normal" http2 request from the client...

the trace log is huge, i tried to only paste the necessary parts here, but let me know if you are missing something.

thanks a lot for your help!

icing commented 2 years ago

Thanks for the analysis and logs. Indeed, the question seems to be why the stream output beam has been aborted before any response was produced.

Do you see any lines with mood update entries in your logs with trace1 level? A h2 connection might tear down streams if it thinks the client does not behave well and consumes to many workers. That would cause output beam to become aborted.

daum3ns commented 2 years ago

hello, yes i can see such log lines (270-9 failed, 270-11 succeeded):

2022-04-26 17:23:41.259032 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, decreasing worker limit to 4
2022-04-26 17:23:41.259037 http2:trace2 172.22.1.196:55080 h2_mplx(270-11): unschedule, resetting task for redo later
2022-04-26 17:23:41.259040 http2:trace2 172.22.1.196:55080 h2_mplx(270-9): unschedule, resetting task for redo later

i also see logs where it says increasing worker, not decreasing:

2022-04-26 17:23:41.457494 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 8
2022-04-26 17:23:41.961517 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 16
2022-04-26 17:23:42.301421 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 32
2022-04-26 17:23:42.446656 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 37

all these lines appear between reading/receiving the request and then, about 0.2 seconds later, when they are beeing "processesd"..

however, would this mean that we need to increase H2MaxWorkers or H2MaxSessionStreams? im sure we tried out increasing H2MaxWorkers and it didn't help...

icing commented 2 years ago

Thanks for digging these up!

however, would this mean that we need to increase H2MaxWorkers or H2MaxSessionStreams? im sure we tried out increasing H2MaxWorkers and it didn't help...

No, that does not really change things. I assume you can build the module from source. If you remove lines 1009-1011 in h2_mplx.c, I hope you see the problems go away. These lines in m_be_annoyed():

    if (m->tasks_active > m->limit_active) {
        status = m_unschedule_slow_tasks(m);
    }

daum3ns commented 2 years ago

hi,

it works! we have tried this change and i can confirm that the problem is no longer reproducible.

will this become a permanent change or is this a fix for our setup only? thanks very much!

icing commented 2 years ago

I will add this to the server and backport for the next release. Thanks for verifying!

icing / mod_h2

randomly failing requests with http2 and server sent events #231