mojolicious / mojo

:sparkles: Mojolicious - Perl real-time web framework
https://mojolicious.org
Artistic License 2.0
2.66k stars 577 forks source link

Prevent potentially misleading error message about "no heartbeat" #2180

Open okurz opened 4 months ago

okurz commented 4 months ago

In the context of github.com/os-autoinst/openQA we observed error messages coming from Mojo::Server::Prefork when otherwise there is no harmful effect observed. As the underlying prefork module should only inform about a potentially erroneous situation but can not know about the effect on any component that uses the prefork module the log message instead should not claim an error but rather only inform. Hence this commit reduces the log level from error to info accordingly.

Related progress issue: https://progress.opensuse.org/issues/138536

guest20 commented 4 months ago

Is there a reason it's not possible for y'all to extended the timeout it with the attribute "heartbeat_timeout" in Mojo::Server::Prefork since your application requires it?

okurz commented 4 months ago

Is there a reason it's not possible for y'all to extended the timeout it with the attribute "heartbeat_timeout" in Mojo::Server::Prefork since your application requires it?

yes, we already use 900s. As far as I understood the evaluation in https://progress.opensuse.org/issues/106759 and https://progress.opensuse.org/issues/138536 regarding how this is used within the context of github.com/os-autoinst/openQA/ we will run into the timeout regardless. Probably something in the implementation is sub-optimal but anyway I still think the message with a log level of "error" is too strict

kraih commented 4 months ago

Why info over warn?

okurz commented 4 months ago

Why info over warn?

I would say same reasoning because I think this component can not know if the condition is harmful to the caller but if you guys prefer "warn" that would also help us.

guest20 commented 4 months ago

Am I misunderstanding what the log message means?

Something blocked the worker for more than the heartbeat timeout and the supervisor has decided to kill that worker, causing anybody with a pending request to that worker to never receive any more bytes in response... Is that the situation this message warns about?

kraih commented 3 months ago

Am I misunderstanding what the log message means?

Something blocked the worker for more than the heartbeat timeout and the supervisor has decided to kill that worker, causing anybody with a pending request to that worker to never receive any more bytes in response... Is that the situation this message warns about?

Yes, it's also in the FAQ. https://docs.mojolicious.org/Mojolicious/Guides/FAQ#What-does-Worker-31842-has-no-heartbeat-50-seconds-restarting-mean

guest20 commented 3 months ago

@kraih Cool, though I don't see how "I'm about to TERM a blocking worker" could ever be part of normal/safe operation?

Is there any way for the dying worker to log the requests it was stuck on when the TERM arrived? The OpenQA ticket looks like it would be a much shorter investigation if that info were in the logs...

Also, would it be worth suggesting moving the blocking code into a Mojo::IOLoop->subprocess or a Minion entry to allow the requests/heartbeats to continue while doing the long-running work?

kraih commented 3 months ago

Is there any way for the dying worker to log the requests it was stuck on when the TERM arrived? The OpenQA ticket looks like it would be a much shorter investigation if that info were in the logs...

You're welcome to try and come up with a proposal for that.