Open jonasraoni opened 11 months ago
Code changes will require restarting supervisor, which will also require care on production environments (there must be a way to gracefully restart and avoid interrupting running jobs)
Solution for this is not to restart the supervisor but restart the worker . Supervisor is a long running process that monitor if the workers are running properly . So basically one need to apply php artisan queue:restart
command which is gracefully kill each of the worker(even when multiple worker) after they complete the current job and Supervisor will then restart the worker/s .
Currently we do not have this and artisan
is not available but should not be too difficult to bring it in within the our system and job command .
Basically point is supervisor should not be restart but the workers(gracefully) as abruptly restart supervisor will kill the worker/s immediately even when it's in the middle of processing a job which may cause unexpected behaviour/outcome .
Yep! Assume supervisor
means worker(s)
:)
So, I think the solution for now is to force the max-jobs
to be always 1, which will quit the worker, and update the documentation to state that it just works together with a supervisor like application.
@touhidurabir @asmecher check if you agree (failed jobs also trigger the quit properly)
@jonasraoni If I understand correctly the purpose of this change are :
To handle the adaptation of changes, I think a better way is to introduce a command similar to php artisan queue:restart
to our jobs
command such as jobs restart
. it will instruct the worker daemon
to gracefully quit(that is completing the current job) and supervisor will eventually restart it . I plan to work on this this week as it should be a very minor enhancement . see the comment at https://github.com/pkp/pkp-lib/issues/9591#issuecomment-1921187521 .
Now to handle the issues with static data and cache, this is more of an application architecture level requirement than worker specific issue . If we have problem that arise due to static data or cache at job execution via worker daemon, that means we need to address at the application level . Otherwise we will only introduce a patch up rather than addressing the real problem and that problem will eventually grow and get out of hand . I am more interested to know for which jobs such issue are arising when running via worker daemon and if we can fix it at application level . This also reduce the worker efficiency and real purpose by booting the application on each job execution which in a sense beat the one of the core purpose of worker daemon . I leave the final say to @asmecher .
The graceful restart/shutdown is required, but I'm not going to cover them here.
My objective is solely to address the static data/cache, to avoid unexpected issues for users.
Basically, I don't trust that our application is safe to run as a daemon, like a Java/C# HTTP server. Nobody coded things with this expectation in mind and we've got a couple of caches spread across the application and there are the other issues that I've mentioned on the description.
As an example, the Site
object is cached, if the user update any setting, it won't be seen by the worker.
For now, I think it's safer to disable the worker/restart it after each job (which is what the linked PR does), and accept the performance penalty, until we ensure things are ok (another issue can be created for that, to not be forgotten, see that I just commented the code instead of removing it). Getting back to the previous behavior won't require changes at the user side.
@jonasraoni, on the other hand, we have very few jobs and developers are already forced to code cautiously, e.g. by specifying a router explicitly when building a URL using $dispatcher->url
.
I think longer-term we'll need to clean up our use of static caching in favour of e.g. object caching, which we'll need for performance reasons anyway; in the meantime, it'll be necessary for the dev working on jobs to be careful about assumptions. We will also require this for invitations -- the handling code will need to be supplied a minimal set of database IDs, and any assumptions about relationships will need to be checked when handling the invitation to ensure that the underlying conditions haven't changed in the meantime.
I'm not worried about the $site
and (to a lesser extent) $context
objects. Changes to their settings will occur very rarely.
My preference is to disable the long running process until we do a review.
My worst fear is cached data from the journal A being shared with the journal B, this might lead to irreversible problems.
Would we not already see problems like that with scheduled tasks, either those that piggy-back off requests using shutdown functions or those using cron? Those both tend to iterate through all contexts looking for work to do, so would be prey to the same issue. I'm not sure it's a new risk.
@asmecher I've just reviewed our usage of static variables.
Few places that could become memory hogs on a long-running process, for example https://github.com/pkp/pkp-lib/blob/f2f55822885a2143ed78154946e48f42afd3eab9/classes/search/SubmissionSearchDAO.php#L82
About this one, I've inspected a large installation, and the number of characters to load all keywords would be 2MB (as it's UTF-8, maybe a bit more). I think it's ok to ignore for now, but a LFU cache would be a better structure here.
Some things kept at the Registry
class might cache Context
s internally, like the TemplateManager
, Request
, etc, but I didn't find jobs using them directly (hopefully not indirectly as well...).
So, there are some existing issues to solve (in another issue):
Site
) should incur in a worker restart when updated.I'll close the PRs and leave this issue open to investigate further.
Describe the bug With the introduction of jobs in the 3.4 version of the softwares, we've also recently started promoting the usage of
supervisor
as a way to execute the jobs instead ofcron
.AFAICS when using
supervisor
a PHP process is kept in memory which leads to some issues:function () { static $cache; }
) might pose a poisoning risk (e.g. unrelated cached data from a previous job reused by another)supervisor
, which will also require care on production environments (there must be a way to gracefully restart and avoid interrupting running jobs)Solution On the long run, we can address everything on a code level. On the short run, we might inspect whether it's possible to run only a single job and restart the process or to stop promoting
supervisor
overcron
.To Reproduce
file_put_contents('out.txt', Class::$variable ??= uniqid());
), notice the class must be "loadable".uniqid()
value is persisted across the calls.Related
What application are you using? OJS 3.4