Inspect effects of using supervisor to handle jobs

jonasraoni commented 11 months ago

Describe the bug With the introduction of jobs in the 3.4 version of the softwares, we've also recently started promoting the usage of supervisor as a way to execute the jobs instead of cron.

AFAICS when using supervisor a PHP process is kept in memory which leads to some issues:

We've never had this condition in the software before, so unexpected problems might happen regarding:
- Cache kept inside classes through static variables or inside functions (function () { static $cache; }) might pose a poisoning risk (e.g. unrelated cached data from a previous job reused by another)
- Broken/unexpected singletons (e.g. something might be already initialized, but the previous code could expect the inverse)
- Broken expectations (e.g. if plugins are initialized just once, certain hooks/events might not be fired)
- Shutdown functions probably won't be executed
- On a multi-journal level these problems are multiplied (e.g. job ends up caching something from journal A, then the job for journal B uses the cached data).
- Memory leaks due to accumulated cache
- Resource locks/race conditions (in case Laravel can handle jobs in parallel)
Code changes will require restarting supervisor, which will also require care on production environments (there must be a way to gracefully restart and avoid interrupting running jobs)

Solution On the long run, we can address everything on a code level. On the short run, we might inspect whether it's possible to run only a single job and restart the process or to stop promoting supervisor over cron.

To Reproduce

See https://laravel.com/docs/10.x/queues#the-queue-work-command or:

Dispatch at least two jobs that makes use of a static class variable (e.g. file_put_contents('out.txt', Class::$variable ??= uniqid());), notice the class must be "loadable".
See that the uniqid() value is persisted across the calls.

Related

https://github.com/pkp/pkp-lib/issues/9345

What application are you using? OJS 3.4

touhidurabir commented 10 months ago

Code changes will require restarting supervisor, which will also require care on production environments (there must be a way to gracefully restart and avoid interrupting running jobs)

Solution for this is not to restart the supervisor but restart the worker . Supervisor is a long running process that monitor if the workers are running properly . So basically one need to apply php artisan queue:restart command which is gracefully kill each of the worker(even when multiple worker) after they complete the current job and Supervisor will then restart the worker/s .

Currently we do not have this and artisan is not available but should not be too difficult to bring it in within the our system and job command .

Basically point is supervisor should not be restart but the workers(gracefully) as abruptly restart supervisor will kill the worker/s immediately even when it's in the middle of processing a job which may cause unexpected behaviour/outcome .

jonasraoni commented 10 months ago

Yep! Assume supervisor means worker(s) :)

So, I think the solution for now is to force the max-jobs to be always 1, which will quit the worker, and update the documentation to state that it just works together with a supervisor like application.

jonasraoni commented 10 months ago

@touhidurabir @asmecher check if you agree (failed jobs also trigger the quit properly)

touhidurabir commented 10 months ago

@jonasraoni If I understand correctly the purpose of this change are :

it's to allow to adapt the changes without restarting supervisor .
handle the issues with static variable and cache issues .

To handle the adaptation of changes, I think a better way is to introduce a command similar to php artisan queue:restart to our jobs command such as jobs restart . it will instruct the worker daemon to gracefully quit(that is completing the current job) and supervisor will eventually restart it . I plan to work on this this week as it should be a very minor enhancement . see the comment at https://github.com/pkp/pkp-lib/issues/9591#issuecomment-1921187521 .

Now to handle the issues with static data and cache, this is more of an application architecture level requirement than worker specific issue . If we have problem that arise due to static data or cache at job execution via worker daemon, that means we need to address at the application level . Otherwise we will only introduce a patch up rather than addressing the real problem and that problem will eventually grow and get out of hand . I am more interested to know for which jobs such issue are arising when running via worker daemon and if we can fix it at application level . This also reduce the worker efficiency and real purpose by booting the application on each job execution which in a sense beat the one of the core purpose of worker daemon . I leave the final say to @asmecher .

jonasraoni commented 10 months ago

The graceful restart/shutdown is required, but I'm not going to cover them here.
My objective is solely to address the static data/cache, to avoid unexpected issues for users.

Basically, I don't trust that our application is safe to run as a daemon, like a Java/C# HTTP server. Nobody coded things with this expectation in mind and we've got a couple of caches spread across the application and there are the other issues that I've mentioned on the description. As an example, the Site object is cached, if the user update any setting, it won't be seen by the worker.

For now, I think it's safer to disable the worker/restart it after each job (which is what the linked PR does), and accept the performance penalty, until we ensure things are ok (another issue can be created for that, to not be forgotten, see that I just commented the code instead of removing it). Getting back to the previous behavior won't require changes at the user side.

asmecher commented 10 months ago

@jonasraoni, on the other hand, we have very few jobs and developers are already forced to code cautiously, e.g. by specifying a router explicitly when building a URL using $dispatcher->url.

I think longer-term we'll need to clean up our use of static caching in favour of e.g. object caching, which we'll need for performance reasons anyway; in the meantime, it'll be necessary for the dev working on jobs to be careful about assumptions. We will also require this for invitations -- the handling code will need to be supplied a minimal set of database IDs, and any assumptions about relationships will need to be checked when handling the invitation to ensure that the underlying conditions haven't changed in the meantime.

I'm not worried about the $site and (to a lesser extent) $context objects. Changes to their settings will occur very rarely.

jonasraoni commented 10 months ago

My preference is to disable the long running process until we do a review.

My worst fear is cached data from the journal A being shared with the journal B, this might lead to irreversible problems.

asmecher commented 10 months ago

Would we not already see problems like that with scheduled tasks, either those that piggy-back off requests using shutdown functions or those using cron? Those both tend to iterate through all contexts looking for work to do, so would be prey to the same issue. I'm not sure it's a new risk.

jonasraoni commented 9 months ago

@asmecher I've just reviewed our usage of static variables.

Few places that could become memory hogs on a long-running process, for example https://github.com/pkp/pkp-lib/blob/f2f55822885a2143ed78154946e48f42afd3eab9/classes/search/SubmissionSearchDAO.php#L82

About this one, I've inspected a large installation, and the number of characters to load all keywords would be 2MB (as it's UTF-8, maybe a bit more). I think it's ok to ignore for now, but a LFU cache would be a better structure here.

Some things kept at the Registry class might cache Contexts internally, like the TemplateManager, Request, etc, but I didn't find jobs using them directly (hopefully not indirectly as well...).

So, there are some existing issues to solve (in another issue):

Some static structures (e.g. Site) should incur in a worker restart when updated.
Code changes are not visible until the worker is restarted (solvable by a file watcher/extra instructions on the documentation).

I'll close the PRs and leave this issue open to investigate further.

pkp / pkp-lib

Inspect effects of using supervisor to handle jobs #9591