Run sync task in its own subprocess

medihack commented 3 months ago

Currently, a synchronous task runs in its own thread (since v2.13.0 / PR #1160). As discussed in #1156, we should evaluate whether we want to run a synchronous task in its own subprocess.

Advantages:

Good for CPU-intensive work (for I/O intensive work, we already have async tasks)
Shutdown and timeout handling is easier as subprocesses can be killed
Tasks are more isolated (no threading issues)

Disadvantages:

Complex implementation
- Communication between job and worker (signals)
- Calling the job_manager (database context, connector, ....)
Problematic Windows support (no forking of processes on Windows)

ewjoachim commented 3 months ago

I have to admit I have limited multiprocessing experience (it just happens to never have been on my radar).

From what I know, Pipes and Queues might be what Python gives us to communicate between processes. Object put in queues are pickled. pipes let us transfer text payloads.

Another complex point (but linked to the JobManager point) is the psycopg pool: does multiprocesing imply that each process will open its own pool ? That might be a little overkill, though I don't know how it will play. Especially: we might not need a connection except if we use task.defer from within the task. We could hack something to use a special connector in the task process that sends postgres queries in the pipe, to be handled by the parent.

medihack commented 3 months ago

From what I know, Pipes and Queues might be what Python gives us to communicate between processes. Object put in queues are pickled. pipes let us transfer text payloads.

Yes, and events (for something like the abort request).

Another complex point (but linked to the JobManager point) is the psycopg pool: does multiprocesing imply that each process will open its own pool ? That might be a little overkill, though I don't know how it will play. Especially: we might not need a connection except if we use task.defer from within the task. We could hack something to use a special connector in the task process that sends postgres queries in the pipe, to be handled by the parent.

Yes, if the database is queried directly, each process would use its own connection pool. I find it difficult to judge whether this could really be a problem in a real-life application or just a theoretical problem. Having a special connector and doing something like RPC between the processes sounds like a cool idea. Unfortunately, the same problem exists for database connections besides Procrastinate, when, for example, users do Django model queries. Those connections would take place in the subprocess anyway.

ewjoachim commented 2 months ago

I'm a bit wary of changing the model just like this. I wonder if we should maybe add options (we don't have to pick them all):

Legacy runner that runs async tasks & sync tasks via async to sync (breaking change could be that you need to set your worker to legacy explicitly to get the previous behaviour, or legacy could be the default when nothing is specified with a DeprecationWarning)
Subprocess runner

As usual the annoying part is to try and guess how it's going to be like for folks who use just async, just sync, a mix of both, with or without Django etc.

medihack commented 2 months ago

Or, make it configurable, as Huey does with worker types. We could have a sync_type option on the worker (or the task itself). But there is so much stuff already in the v3 release that we should postpone this feature for a later release. We can add it as an experimental feature with a minor release (and keep the default sync tasks threaded), and when we are sure it's stable, we can still switch to subprocesses as the default.

ewjoachim commented 2 months ago

We can add it as an experimental feature

Yes that was my point: I'm comfortable with adding it, I'm comfortable with it having better support over some features (such as aborting) than the standard way but I'm not (yet) comfortable with removing the way we do things now.

procrastinate-org / procrastinate

Run sync task in its own subprocess #1161