Closed daum3ns closed 1 year ago
Thanks for the analysis and logs. Indeed, the question seems to be why the stream output beam has been aborted before any response was produced.
Do you see any lines with mood update
entries in your logs with trace1 level? A h2 connection might tear down streams if it thinks the client does not behave well and consumes to many workers. That would cause output beam to become aborted.
hello, yes i can see such log lines (270-9 failed, 270-11 succeeded):
2022-04-26 17:23:41.259032 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, decreasing worker limit to 4
2022-04-26 17:23:41.259037 http2:trace2 172.22.1.196:55080 h2_mplx(270-11): unschedule, resetting task for redo later
2022-04-26 17:23:41.259040 http2:trace2 172.22.1.196:55080 h2_mplx(270-9): unschedule, resetting task for redo later
i also see logs where it says increasing worker, not decreasing:
2022-04-26 17:23:41.457494 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 8
2022-04-26 17:23:41.961517 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 16
2022-04-26 17:23:42.301421 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 32
2022-04-26 17:23:42.446656 http2:trace1 172.22.1.196:55080 h2_mplx(270): mood update, increasing worker limit to 37
all these lines appear between reading/receiving the request and then, about 0.2 seconds later, when they are beeing "processesd"..
however, would this mean that we need to increase H2MaxWorkers or H2MaxSessionStreams? im sure we tried out increasing H2MaxWorkers and it didn't help...
Thanks for digging these up!
however, would this mean that we need to increase H2MaxWorkers or H2MaxSessionStreams? im sure we tried out increasing H2MaxWorkers and it didn't help...
No, that does not really change things. I assume you can build the module from source. If you remove lines 1009-1011 in h2_mplx.c, I hope you see the problems go away. These lines in m_be_annoyed()
:
if (m->tasks_active > m->limit_active) {
status = m_unschedule_slow_tasks(m);
}
hi,
it works! we have tried this change and i can confirm that the problem is no longer reproducible.
will this become a permanent change or is this a fix for our setup only? thanks very much!
I will add this to the server and backport for the next release. Thanks for verifying!
hello,
we face a strange behavior where some http2 requests fail with the error
unfortunately it seems to be completely random when/which requests fail. we can force it by clicking around in the web application, but we were not able identify a clear way to reproduce. even exactly identical requests from the client sometimes work and sometimes fail. this makes it very difficult to compare "good" cases with "bad" cases...
httpd is version 2.4.53, mod_http2 v1.15.26 (apache is acting as a reverse proxy to a back end application server..)
what we can tell so far is:
we have tested different H2 directives like increasing workers, window size and timeouts, but these all don't seem to have an impact on the issue...
to track the issue down we have enabled tracing in the module:
what i found out of this is:
the log line
originates from the function h2_stream_out_prepare here , so this means that at the time the function h2_beam_receive (the next line in the linked code) is called, the aborted=1 flag is already set. so when i have a look at the code there, i think this results in the function returning the status APR_ECONNABORTED. as far as i understand this also means that h2_stream_out_prepare returns this status to its caller on_stream_resume in h2_session.c (the caller is visible from the trace log "on_resume"). this condition is then false, because otherwise we would see more output in the trace log. (originated from add_buffered_data function in h2_stream.c)
so i think all that means, that the next two conditions are true :
so for me, the main question is, why is aborted=1 so early ? what does it mean "we have data to send, but no response"?
maybe the condition
status != APR_EAGAIN
is wrong, or maybe the case that the state at this point is APR_ECONNABORTED should not even ever happen?for completeness, a trace log of a "good" case, from the comparison i can't really see much differences (except the aborted=1 flag... )
(the stream then continues)
any ideas what could be the issue here, or what we can do to track it more down can be helpful. also maybe you can read out something more of these trace logs with your detailed knowledge of the module.
the fact that it is not clearly reproducible, and that everything works fine without server sent events somehow lets me think of a race condition or something similar happening when back end sends these events. on the other hand it works with http1, so this is a clear indicator that the mod_http2 is causing the error.
also, its very strange that the server sent events somehow have an impact on other requests made from the client. because the actual request causing the error is not an SSE response its a "normal" http2 request from the client...
the trace log is huge, i tried to only paste the necessary parts here, but let me know if you are missing something.
thanks a lot for your help!