ukwa / ukwa-manage

Shepherding our web archives from crawl to access.
Apache License 2.0
10 stars 5 forks source link

Write tools to list DC buckets and contents from AWS #97

Closed anjackson closed 1 year ago

anjackson commented 2 years ago

For reporting, we need a script that lists the DC buckets on AWS, and for every file, reports at least the full path and the file size of each. This should output lines of JSON, and an ID like s3a://<bucket>/<dir>/<file> - it's intended to be compatible with TrackDB.

Pretty-printed, it might be something like:

{
  "id": "s3a://<bucket>/<dir>/<file>",
  "refresh_date_dt": "2022-05-05T00:31:34.752Z",
  "file_path_s": "<dir>/<file>",
  "file_size_l": "<size-in-bytes>",
  "modified_at_dt": "2016-05-24T09:56:00.000Z",
}

Compared with the ones from HDFS:

{
  "id": "hdfs://hdfs:54310/0_original/fc/crawler03/heritrix/output/images/1365759097.jpg",
  "refresh_date_dt": "2022-05-05T00:31:34.752Z",
  "file_path_s": "/0_original/fc/crawler03/heritrix/output/images/1365759097.jpg",
  "file_size_l": "154155",
  "file_ext_s": ".jpg",
  "file_name_s": "1365759097.jpg",
  "permissions_s": "-rw-r--r--",
  "hdfs_replicas_i": "3",
  "hdfs_user_s": "hdfs",
  "hdfs_group_s": "supergroup",
  "modified_at_dt": "2016-05-24T09:56:00.000Z",
  "timestamp_dt": "2016-05-24T09:56:00.000Z",
  "year_i": "2016",
  "recognised_b": "False",
  "kind_s": "unknown",
  "collection_s": "0_original",
  "stream_s": "None",
  "job_s": "None",
  "layout_s": "None",
  "hdfs_service_id_s": "h020",
  "hdfs_type_s": "file",
  "access_url_s": "http://hdfs.api.wa.bl.uk/webhdfs/v1/0_original/fc/crawler03/heritrix/output/images/1365759097.jpg?op=OPEN&user.name=access"
}

Any old script anywhere is a good start, but later it should be integrated into this repository, so it can be run regularly from Airflow.

anjackson commented 1 year ago

This script performs the basic listing: https://github.com/ukwa/ukwa-manage/blob/master/lib/store/aws_s3_lister.py

The question is how to use this, and integrate with TrackDB/something else, to track items and build reports.

anjackson commented 1 year ago

Note that the Python script could be replaced with https://rclone.org/commands/rclone_lsjson/ (or at least make the output consistent!)

anjackson commented 1 year ago

Note also that there is another approach to this: https://alexwlchan.net/2023/s3-bucket-inventory/

anjackson commented 1 year ago

Implemented a direct lister using boto3 which has been deployed for some time.

https://github.com/ukwa/ukwa-manage/blob/e389424cf3017efb3269fd4aae64a0ba3d2e120f/lib/filedb/aws_s3_lister.py#L52-L81