sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.36k stars 462 forks source link

PARI is not thread-safe from above #28800

Open embray opened 4 years ago

embray commented 4 years ago

This is a follow-up to #26608. That ticket specifically discussed the issue of multi-threaded PARI causing Sage's docbuild to break. That problem is worked around in #28356, so I decided to close #26608.

However, the general problem remains, which is that PARI is not thread-safe from above, meaning that while threads created and managed by the PARI library itself work fine, threads created in a multi-system environment (like Sage) which happen to use PARI (specifically PARI built with multi-threading support) it will segfault.

This has been discussed in #26608 as well as this discussion on the OpenDreamKit project as well as related in-person conversations for which I unfortunately lack notes.

With #26608 resolved this is fortunately not an immediate problem for Sage, though it would be very easy for someone thinking they can just carelessly use threads (e.g. from the Python level) in their own code and experience similar crashes.

This is also not a problem just of PARI; getting this kind of multi-system multi-level parallelism right is hard, and should require cooperation, and specific guidelines to follow. Although I have not yet hit any other specific examples I have no doubt that the exist; for example I would not be surprised if HPC-GAP has similar problems.

Upstream: Reported upstream. Developers deny it's a bug.

CC: @antonio-rojas @jdemeyer @kiwifb @dimpase @saraedum @embray @timokau

Component: interfaces

Keywords: pari threading parallelism gap

Issue created by migration from https://trac.sagemath.org/ticket/28800

embray commented 4 years ago

Description changed:

dimpase commented 4 years ago
comment:3

Macaulay2 does have these problems with Pari, AFAIK. They are thinking of removing Pari from their dependencies all together due to this.

embray commented 4 years ago
comment:4

As I think I mentioned in the original discussion, the issue could be mitigated in PARI somewhat, with a few strategically-placed checks to ensure that important thread-local variables have been initialized, and re-initialize them as needed (using more-or-less existing code for re-initializing PARI in its own, self-managed threads).

embray commented 4 years ago
comment:5

An alternative approach (although one that would still be made easier with some internal refactoring of PARI*) would be like Python's PyGILState_Ensure(). This places some onus on users of PARI in multi-threaded code to make sure PARI's interpreter state is properly initialized before using it in a new thread, which is not an unfair thing to ask users to do.

* I should clarify what I mean by this. When multi-threading was added to PARI, a number of global variables were simply converted directly to thread-local variables (sometimes it's not clear to me if all of them need to be thread-local; I don't know). By contrast, to use CPython again as an example (since I know it well), all variables that need to be thread specific (e.g. in PARI this would include things like the main stack pointer) are collected into a single PyThreadState struct, which makes it much easier to manage. Each thread has its own threadstate stored in a thread-local variable, so just one variable instead of a whole bunch (meaning only one call to the TSS APIs to get/set it). A similar reorganization of PARI's thread-specific variables would be helpful.

dimpase commented 4 years ago
comment:6

one way around this difficulty is to use spawn, not fork, in multiprocessing, something that is available in Python 3.7 and later. One just calls

multiprocessing.set_start_method('spawn')

somewhere early enough.

This alone is not enough, one needs to rework various things due to spawn pickling the environment.

embray commented 4 years ago
comment:7

Yeah, setting set_start_method('spawn') globally would wreak havoc, though might be useful in some careful cases. I think this specific issue would be better addressed with improvements to how PARI manages its thread-local state.