procrastinate-org / procrastinate

PostgreSQL-based Task Queue for Python
https://procrastinate.readthedocs.io/
MIT License
840 stars 52 forks source link

Run sync task in its own subprocess #1161

Open medihack opened 4 weeks ago

medihack commented 4 weeks ago

Currently, a synchronous task runs in its own thread (since v2.13.0 / PR #1160). As discussed in #1156, we should evaluate whether we want to run a synchronous task in its own subprocess.

Advantages:

Disadvantages:

ewjoachim commented 4 weeks ago

I have to admit I have limited multiprocessing experience (it just happens to never have been on my radar).

From what I know, Pipes and Queues might be what Python gives us to communicate between processes. Object put in queues are pickled. pipes let us transfer text payloads.

Another complex point (but linked to the JobManager point) is the psycopg pool: does multiprocesing imply that each process will open its own pool ? That might be a little overkill, though I don't know how it will play. Especially: we might not need a connection except if we use task.defer from within the task. We could hack something to use a special connector in the task process that sends postgres queries in the pipe, to be handled by the parent.

medihack commented 3 weeks ago

From what I know, Pipes and Queues might be what Python gives us to communicate between processes. Object put in queues are pickled. pipes let us transfer text payloads.

Yes, and events (for something like the abort request).

Another complex point (but linked to the JobManager point) is the psycopg pool: does multiprocesing imply that each process will open its own pool ? That might be a little overkill, though I don't know how it will play. Especially: we might not need a connection except if we use task.defer from within the task. We could hack something to use a special connector in the task process that sends postgres queries in the pipe, to be handled by the parent.

Yes, if the database is queried directly, each process would use its own connection pool. I find it difficult to judge whether this could really be a problem in a real-life application or just a theoretical problem. Having a special connector and doing something like RPC between the processes sounds like a cool idea. Unfortunately, the same problem exists for database connections besides Procrastinate, when, for example, users do Django model queries. Those connections would take place in the subprocess anyway.

ewjoachim commented 6 days ago

I'm a bit wary of changing the model just like this. I wonder if we should maybe add options (we don't have to pick them all):

As usual the annoying part is to try and guess how it's going to be like for folks who use just async, just sync, a mix of both, with or without Django etc.

medihack commented 6 days ago

Or, make it configurable, as Huey does with worker types. We could have a sync_type option on the worker (or the task itself). But there is so much stuff already in the v3 release that we should postpone this feature for a later release. We can add it as an experimental feature with a minor release (and keep the default sync tasks threaded), and when we are sure it's stable, we can still switch to subprocesses as the default.

ewjoachim commented 6 days ago

We can add it as an experimental feature

Yes that was my point: I'm comfortable with adding it, I'm comfortable with it having better support over some features (such as aborting) than the standard way but I'm not (yet) comfortable with removing the way we do things now.