procrastinate-org / procrastinate

PostgreSQL-based Task Queue for Python
https://procrastinate.readthedocs.io/
MIT License
840 stars 52 forks source link

Run sync task in its own thread #1156

Closed medihack closed 4 weeks ago

medihack commented 1 month ago

Closes #1124

Successful PR Checklist:

PR label(s):

github-actions[bot] commented 1 month ago

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  procrastinate
  utils.py
Project Total  

This report was generated by python-coverage-comment-action

medihack commented 1 month ago

Running the sync task in its own thread ensures that it is more isolated. One test (with nested asyncio) looks a bit constructed but fails (even without any exception) when the sync task runs in the main thread (thread_sensitive=True). I tried to find any negative consequences (especially for Django), but as db connections in Django are created per thread, I don't see any. However, I am not absolutely sure.

ewjoachim commented 4 weeks ago

The discussions in Discord are prompting me to wonder: if we're considering changing this, should we consider running sync tasks in subprocesses ? It would avoid threading issues and would make it much easier to kill a task or to send a signal, the same way we do with async. But I don't want this decision to be made lightly and later regretted.

medihack commented 4 weeks ago

When doing IO-heavy stuff, an async task is preferred over a threaded sync task. (If the task depends on a non-async third-party library, then the chances are high that there is an async alternative.) On the other hand, if the task is compute-heavy, then running it in its own process is the way to go.

As mentioned, we also could make this configurable. As an option for the worker, for example as huey does it (but they chose threaded as the default), or as an option on the task itself (something like @app.task(sync_threaded=True). But even then, the question remains what the default should be (in my opinion, for the above-mentioned reason, a separate process).

The downside of making it configurable is that the worker gets even more complex, and we still can't always terminate the task.

ewjoachim commented 4 weeks ago

If we did: async def -> async (good for I/O & high level of parallelism, bad for CPU) def -> subprocess (low level of parallelism, not specifically bad for I/O, good for CPU)

That would be not that far from what we have today, would it ?

onlyann commented 4 weeks ago

It sounds reasonable to remove the thread sensitivity when wrapping sync tasks into a Thread given current limitations. At least it will help sync tasks that are IO bound.

Maybe this change could be merged while the conversation of introducing sub-processes for CPU bound tasks carries on?

medihack commented 4 weeks ago

If we did: async def -> async (good for I/O & high level of parallelism, bad for CPU) def -> subprocess (low level of parallelism, not specifically bad for I/O, good for CPU)

Yes, this is what I had in mind (when not making it configurable). We would then run the sync task in a ProcessPoolExecutor. I also looked a bit around how other distributed (Python) task queues handle this:

So, the majority seem to use a separate process.

As you already mentioned, another advantage of using a separate process is that it can be killed. That way, we could even introduce some final shutdown timeout that forces the worker to be entirely stopped (see discussion on Discord). I don't know if this is even possible when only using threads. Another feature that would be possible by using a separate process for a sync task is to (optionally) specify the maximum time a task can take.

Maybe this change could be merged while the conversation of introducing sub-processes for CPU bound tasks carries on?

Either way, we said it is a breaking change (is it?!) targeted at v3. So, it won't be in a near-term 2.x release. Or should we merge this one into the main branch and work on a "sub-process" variant in the v3 branch?

onlyann commented 4 weeks ago

Either way, we said it is a breaking change (is it?!) targeted at v3. So, it won't be in a near-term 2.x release. Or should we merge this one into the main branch and work on a "sub-process" variant in the v3 branch?

A merge into main is fine if someone needs it prior to v3 release. It could be argued it is not a breaking change given there is no mention in the docs that sync tasks block each other, and I can hardly imagine a use of this library using long running sync tasks because of it.

That said, my comment was just about merging it into v3. Maybe it won't ever make it to the release if everything is a sub-process but we are better off iterating without thread sensitivity until then.

medihack commented 4 weeks ago

A merge into main is fine if someone needs it prior to v3 release. It could be argued it is not a breaking change given there is no mention in the docs that sync tasks block each other, and I can hardly imagine a use of this library using long running sync tasks because of it.

Yes, I think so, too. One can even argue that the current implementation is even a bug as concurrent sync tasks are never processed in parallel (but consecutively in the main thread). @ewjoachim Would it be ok if we merge this into the main branch, make a minor release, and open another issue / PR with a subprocess implementation?

ewjoachim commented 4 weeks ago

@ewjoachim Would it be ok if we merge this into the main branch, make a minor release, and open another issue / PR with a subprocess implementation?

Yep. Will try to review asap !

medihack commented 4 weeks ago

Yep. Will try to review asap !

Sure, no hurry. Just wanted to know if it's ok when I switch the target branch. (There are conflicts when switching the branch to main. I will put up another PR against main later, and then we can decide what to merge).