seung-lab / cloud-files

Threaded Python and CLI client library for AWS S3, Google Cloud Storage (GCS), in-memory, and the local filesystem.
BSD 3-Clause "New" or "Revised" License
36 stars 8 forks source link

feat: resumable transfers #81

Closed william-silversmith closed 1 year ago

william-silversmith commented 1 year ago

Resolves #80

Adds ResumableTransfer class and cloudfiles xfer CLI subgroup.

The transfer works by first loading a sqlite database with filenames, a "done" flag, and a lease time. Then, clients can attach to the database and execute the transfer in batches. When multiple clients are used, a lease time must be set so that the database does not return the same set of files to each client (and is robust).

To use with a single client:

from cloudfiles import ResumableTransfer

rt = ResumableTransfer("DATABASE_NAME.db") # doesn't have to exist 
rt.init(SOURCE_PATH, DEST_PATH, [ENCODING])
rt.execute(progress=True)
rt.close() # deletes database
cloudfiles xfer init SOURCE DEST --db DATABASE_NAME.db
cloudfiles xfer execute DATABASE_NAME.db

For use with multiple clients, after initialization each client should call only rt.execute(lease_mseg=30000) or cloudiles xfer execute DATABASE_NAME.db --lease-msec 30000.

Note that currently, this implementation only lists the contents cf.list() for the source path into the database. A more customizable method could be introduced. You can also call rt.insert, but make sure that these paths are accessible to the source path.