seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
36 stars 8 forks source link

Global connection pools should be guarded by mutexes #98

Closed ranlu closed 4 months ago

ranlu commented 10 months ago

The connection pools (GC_POOL, S3_POOL, MEM_POOL) are created when we initialize the interfaces for the first time. However, num_threads instances may try to create the same connection pool during initialization. For example, when downloading multiple files from GCS, all GoogleCloudStorageInterface instances created by schedule_jobs reach here at the same time, all found a empty dictionary, because there is no mutex, all of them try to create their own connection pool. In the end, num_threads GCloudBucketPool are created, one is picked for the global connection pool and num_threads-1 of them are immediately destroyed.

Creating/destroying the connection pools seem to be expensive even for empty ones. With num_threads set to 20 by default, the delay is around 1-2 seconds, which is significant in short bursting tasks. The easiest fix is to have a global mutex along side the connection pools so we can lock it to guard against wasteful competitions between threads.

william-silversmith commented 10 months ago

this is a good tip. thank you.

On Mon, Sep 11, 2023 at 9:14 AM ranlu @.***> wrote:

The connection pools (GC_POOL, S3_POOL, MEM_POOL) are created when we initialize the interfaces for the first time. However, num_threads instances may try to create the same connection pool during initialization. For example, when downloading multiple files from GCS, all GoogleCloudStorageInterface instances created by schedule_jobs reach here https://github.com/seung-lab/cloud-files/blob/master/cloudfiles/interfaces.py#L503 at the same time, all found a empty dictionary, because there is no mutex, all of them try to create their own connection pool. In the end, num_threads GCloudBucketPool are created, one is picked for the global connection pool and num_threads-1 of them are immediately destroyed.

Creating/destroying the connection pools seem to be expensive even for empty ones. With num_threads set to 20 by default, the delay is around 1-2 seconds, which is significant in short bursting tasks. The easiest fix is to have a global mutex along side the connection pools so we can lock it to guard against wasteful competitions between threads.

— Reply to this email directly, view it on GitHub https://github.com/seung-lab/cloud-files/issues/98, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATGQSN7NMX3RYQVGB72IRDXZ4FEXANCNFSM6AAAAAA4TKDU54 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

william-silversmith commented 4 months ago

Resolved in e4979ee825b7fe34ad3a2eebef17c8fe7cf314a3