liormizr / s3path

s3path is a pathlib extension for AWS S3 Service
Apache License 2.0
206 stars 39 forks source link

Fix #97: Concurrency Issue #102

Closed sbrandtb closed 2 years ago

sbrandtb commented 2 years ago

The issue was there all the time, but became critical in 9950dd05 with version 0.3.3.

class _S3ConfigurationMap is not thread safe because both set/get_configuration() ask for self.arguments and self.resources being None. However, since the configuration map is a singleton, when using the library in a multithreaded environment, it can happen that half the part of _initial_setup has run and self.arguments is indeed not None, but self.resources is still None because in another thread, creation of the boto3 resource is still ongoing. Then, our thread would not call _initial_setup(), but still die on self.resources being None when trying to access it, leading to the error described in #97.

I could not make the bug appear without introducing the IO delay by creating the resource in _initial_setup(). Maybe it is a CPython implementation detail that makes _initial_setup() atomic or incredibly rare to not be atomic without the creation of the boto resource.

Anyways, the setup is now synchronised with a lock, which is required, again, given that _S3ConfigurationMap is a singleton.

Here is the code that reliably triggers the bug without the change:

from multiprocessing.pool import ThreadPool

from s3path import S3Path

def do(i):
    dst = S3Path.from_uri('s3://something') / f'hello_{i}.txt'
    dst.write_bytes(b'hello')

ThreadPool(processes=20).map(do, range(100))
liormizr commented 2 years ago

add tests