irods / irods_capability_automated_ingest

Other
12 stars 16 forks source link

Implement recursive folder support for S3 bucket syncs #284

Open alanking opened 3 weeks ago

alanking commented 3 weeks ago

Currently, all S3 bucket syncs treat the entire bucket like a flat directory. While this is the nature of S3 buckets, treating "/" characters as individual "sub-folders" in the bucket could massively improve performance. The Minio.list_objects call in the S3 bucket task specifies recursive=True: https://github.com/irods/irods_capability_automated_ingest/blob/ec34cb160e55b3d479c3a9796e5118721757f451/irods_capability_automated_ingest/tasks/s3_bucket_sync.py#L122

This should probably be False, but that would require a lot of other changes.

Additionally, this would greatly improve the potential implementation of #282. As it stands, a query to hold all of the data objects under the target collection is required. This would mean that the entire S3 bucket is being held in memory (possibly - depends on the implementation of Minio.list_objects) and the entire target collection's contents as well, which could potentially be very large.