Closed anjackson closed 1 year ago
This script performs the basic listing: https://github.com/ukwa/ukwa-manage/blob/master/lib/store/aws_s3_lister.py
The question is how to use this, and integrate with TrackDB/something else, to track items and build reports.
Note that the Python script could be replaced with https://rclone.org/commands/rclone_lsjson/ (or at least make the output consistent!)
Note also that there is another approach to this: https://alexwlchan.net/2023/s3-bucket-inventory/
Implemented a direct lister using boto3
which has been deployed for some time.
For reporting, we need a script that lists the DC buckets on AWS, and for every file, reports at least the full path and the file size of each. This should output lines of JSON, and an ID like
s3a://<bucket>/<dir>/<file>
- it's intended to be compatible with TrackDB.Pretty-printed, it might be something like:
Compared with the ones from HDFS:
Any old script anywhere is a good start, but later it should be integrated into this repository, so it can be run regularly from Airflow.