seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
36 stars 8 forks source link

Use fasteners to lock local files #91

Closed ranlu closed 1 year ago

ranlu commented 1 year ago

This mainly helps when using multiple cloud-volume instance with local cache, without locking the cache maybe read when incomplete. Use fastener to put a shared lock when reading files and an exclusive lock when writing files. The patch creates a xxx.lock file when cloud-files try to access file xxx, in the same folder. If it does not have write permit, fallback to the no lock path, though I am not sure if it make sense to still try put files in that case. I did not try cleanup the files, since it is tricky to find out when all the locks are resolved. An alternative approach is to create lock files somewhere else, maybe some folder under CLOUD_FILES_DIR Only tested the patch under linux.

william-silversmith commented 1 year ago

This is definitely very useful. Introducing the locks in the same folder like this though as default behavior is likely not backwards compatible though as some people will be depending on operations like listing files not being dirty. Others, like @supersergiy are depending on reads being as fast as possible. Creating new lock files can also create resource contention if the number of lock files are very large or can be problematic if the directory is read-only (I do see you have a try-catch for PermissionError though, so I guess it's only a performance consideration?)

I think several of these issues can can be pretty easily resolved by putting the locks somewhere in .cloudfiles (maybe .cloudfiles/locks?). The performance issue can be solved with a flag to enable or disable this behavior.

ranlu commented 1 year ago

Thanks for the suggestion @william-silversmith, I added an environment variable CLOUD_FILES_LOCK_DIR to store the lock in a unified location. It also serves as a global switch to turn on locking for local files: if it is unset, cloud-files will not try to lock the files before accessing them. I think adding a option of the interface will be more invasive, as we need to expose the option in all the packages using cloud-files as a backend as well. To avoid name collision, I convert the abspath of the target file to the lock filename by replace '/' with '=+', hopefully few people name their files with those characters.

william-silversmith commented 1 year ago

That approach is reasonable to me. Generally, I like to provide a non-environment variable interface as well that would override the environment variable on an instance basis. Would that be possible?

Maybe locking=None (default, no locking, defer to env variable), locking=True (disregard env variable and lock in the default location of CLOUD_FILES_DIR/locks), locking=False (disregard env variable, and don't lock)? Often programs will have differing requirements depending on the phase of the computation.

ranlu commented 1 year ago

I added the locking option you suggested. I changed the behavior when locking=True a little bit: The env CLOUD_FILES_LOCK_DIR will be used for the locks and only in the case it is unset, CF fall back to CLOUD_FILES_DIR/locks. Otherwise the user lose the ability to use a customized location when they explicitly want locking, which I feel is a rather weird behavior.

ranlu commented 1 year ago

@william-silversmith, what is your suggestion to fix this test, should I add dummy locking options to S3, or a **kwargs to match the other interfaces? The latter feels like belonging to another PR.

william-silversmith commented 1 year ago

Another potential way to handle that is allow True to use the default (your idea makes sense) and if it's a string, use that as the location.

On Tue, Jul 4, 2023, 2:39 PM ranlu @.***> wrote:

I added the locking option you suggested. I changed the behavior when locking=True a little bit: The env CLOUD_FILES_LOCK_DIR will be used for the locks and only in the case it is unset, CF fall back to CLOUD_FILES_DIR/locks. Otherwise the user lose the ability to use a customized location when they explicitly want locking, which I feel is a rather weird behavior.

— Reply to this email directly, view it on GitHub https://github.com/seung-lab/cloud-files/pull/91#issuecomment-1620614688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATGQSN7OT2BOXKKEN6SH3TXORPPXANCNFSM6AAAAAAZ5E33MY . You are receiving this because you were mentioned.Message ID: @.***>

ranlu commented 1 year ago

Another potential way to handle that is allow True to use the default (your idea makes sense) and if it's a string, use that as the location.

If we want to extend the interface we probably should add another lock_dir option, locking representing a path does not sounds right.

william-silversmith commented 1 year ago

I see what you're saying about the semantics of it. I like to keep the number of arguments as small as possible, but a second argument would be acceptable. I'll take a look at the test tomorrow.

ranlu commented 1 year ago

I added a lock_dir option, the behavior now is:

  1. locking = False, no locking
  2. locking = None, enable locking if lock_dir or CLOUD_FILES_LOCK_DIR is set. lock_dir takes priority
  3. locking = True, alway use locking, if neither lock_dir or CLOUD_FILES_LOCK_DIR is set, use CLOUD_FILES_DIR/locks.

The tests work now after #94

ranlu commented 1 year ago

@william-silversmith, Thanks for the review, I factored out the locking code into a io_with_lock function, also append the crc32 of the abspath to the filename to make it hopefully unique.

william-silversmith commented 1 year ago

nice solution!