tbenthompson / cppimport

Import C++ files directly from Python!
MIT License
1.18k stars 67 forks source link

Multiprocessing file lock hangs on certain filesystems #90

Closed vbharadwaj-bk closed 4 months ago

vbharadwaj-bk commented 1 year ago

I recently noticed that cppimport version 22.7.12+ hangs during the build process (our computing center NERSC recently reconfigured their filesystem). The culprit turned out to be a file lock in introduced in #67 and merged in #71 , which enables multiple processes to use a cppimport-compiled module without stepping on each other's toes. The lock is on line https://github.com/tbenthompson/cppimport/blob/main/cppimport/importer.py#L49.

Unfortunately, our new filesystem configuration does not support file locking as it uses a particular data virtualization service (DVS) - see here for documentation of this issue: https://docs.nersc.gov/development/languages/python/parallel-python/#parallel-io-with-h5py. I solved my problem by rolling back to version 22.5.11, right before the file lock was introduced.

Other filesystems may suffer from similar restrictions. Is it worth adding a setting for cppimport that enables / disables the file lock for multiprocessing build support? For what it's worth, I'm using cppimport in code called from multiple processes, and I protect the build process manually outside the package. I really like the fix introduced by the lock and think it would address the need on most filesystems - but it would be nice to turn it off.

If there is interest in adding the setting, I can write a fix and propose a PR. Feedback solicited!

tbenthompson commented 1 year ago

Is it worth adding a setting for cppimport that enables / disables the file lock for multiprocessing build support?

Yes!

Thanks for raising this issue. It makes complete sense.

Thoughts:

vbharadwaj-bk commented 1 year ago

Sounds great! I'll put a PR together that, at a minimum, adds a setting 'use_file_lock' with a default of True, but when False runs the build without the lock. Regarding the comments about automatic detection and exception raising:

On our filesystem, the nature of the error is that filelock silently fails to acquire a lock on any process. The failure to acquire a lock is indistinguishable (for all but one process) from the correctly-working case where a single process acquires the lock successfully and the rest are waiting for the build to finish, at least until the timeout period expires. It would be easy to solve this if the processes could communicate and realize that none of them are acquiring the lock, but I'm not sure that is reasonable. As a result, here are the communication-free possibilities that I can think of for automatic detection / exception raising regarding the lock:

  1. Automatic detection: Every process comes up with a random, unique (with high-probability) lock name and tries acquiring their own lock independently. Enough failures in a reasonable time (say, 20 seconds) raises an exception on each process that file locking is not supported. The downside here is significant extra complexity: one lock per process will have to be created, and something like uuid would be needed to assign unique lock names.

  2. No automatic detection: The processes wait until the timeout period and raise the existing timeout error message with an added caveat of the form above "your filesystem might not support file locking. if that's the case, please set cppimport.settings['use_file_lock'] = False and ensure only one process compiles the module at a time.". The downside here is that the processes must all wait until the lock timeout period expires (10 minutes, when I usually Ctrl+C after a minute) to find the issue. On the upside, I think a section in the README about file locking support would be enough to point people to the correct setting to modify; that's the first place I looked, and this issue should be rare in practice.

I'm leaning towards the second simpler solution that just adds information to the lock timeout error and adds a README section about file locking. It would immediately make the latest version of the package usable (for the correct choice of global settings) on these filesystems. Additional PRs could implement the automatic locking detection on top of this.

tbenthompson commented 1 year ago

Sounds great! I'll put a PR together that, at a minimum, adds a setting 'use_file_lock' with a default of True, but when False runs the build without the lock

Sounds great!

I'm leaning towards the second simpler solution that just adds information to the lock timeout error and adds a README section about file locking. It would immediately make the latest version of the package usable (for the correct choice of global settings) on these filesystems. Additional PRs could implement the automatic locking detection on top of this.

Seems very reasonable!

tbenthompson commented 1 year ago

Thanks!!

drew-parsons commented 5 months ago

I think this is the issue which has now started failing debian CI tests at https://ci.debian.net/packages/c/cppimport/ e.g. https://ci.debian.net/packages/c/cppimport/unstable/amd64/46623673/ https://ci.debian.net/packages/c/cppimport/unstable/arm64/44718580/

The PR will be very welcome.

vbharadwaj-bk commented 5 months ago

Apologies for letting this sleep so long - I've opened a PR with an option to disable the file lock and updated the tests. The file lock remains on by default, so the functionality of any working code is unchanged. All tests except for the multiprocessing test now disable the file lock. The multiprocessing test is disabled by default, but can be run with the invocation pytest --multiprocessing to test the file lock functionality.

Feedback very welcome, some of these choices were opinionated.

drew-parsons commented 5 months ago

The patch looks good to me (but I didn't test it yet patched into my code). Disabling the multiprocessing test by default would keep CI testing stable.

One question for the --multiprocessing flag: is there a (simple) way to check if filelock is supported on a given system? e.g. any entry recorded under /proc/fs/ or another utility tool? /proc/fs/ext4/dm-0/options, for instance, reports dioread_nolock, but I figure that's not relevant (the multiprocessing test passes on this system). If there is a simple check, then I'd use it to activate --multiprocessing in the debian CI scripts.

vbharadwaj-bk commented 5 months ago

@drew-parsons Not sure if there's a universal way to check across systems - but at least on linux, this was helpful using flock. This snippet does the job:

touch testfile
flock ./testfile true && echo ok || echo nok

I tested this snippet on a DVS filesystem that does not support locking (result below):

vbharadw> flock ./testfile true && echo ok || echo nok
flock: ./testfile: Unknown error 524
nok

And on a filesystem that does support locking:

vbharadw> flock ./testfile true && echo ok || echo nok
ok
drew-parsons commented 5 months ago

Easy enough to use in a CI script, thanks!

vbharadwaj-bk commented 4 months ago

Merged- thanks @tbenthompson for the review (and overall maintenance of this package) and @drew-parsons for the nudge!