pangeo-cmip6 / sync

Workflows to keep CMIP6 data synchronized between GCS and S3 storage
2 stars 1 forks source link

S3 cleanup #3

Closed naomi-henderson closed 3 years ago

naomi-henderson commented 3 years ago

@charlesbluca, there are a few old catalog files which need to be deleted in S3 when you get a chance:

s3://cmip6-pds/cmip6.csv
s3://cmip6-pds/pangeo-cmip6-testing.csv
s3://cmip6-pds/pangeo-cmip6-testing.json

Also a quick question - the rclone scripts just copy all of the new zarr stores to AWS, but don't remove the old ones? If not we need to remember to remove these. For example, gs://cmip6/ScenarioMIP has been renamed as gs://cmip6/CMIP6/ScenarioMIP, so once that has been copied to AWS, all stores in s3://cmip6-pds/ScenarioMIP/ could be removed. But no rush - maybe we can find someone new to take care of this by then ...

charlesbluca commented 3 years ago

The rclone scripts should be handling deletes - I think the issue might be that my job for handling catalogs uses copy instead of sync to move them into the bucket, so it doesn't check for catalogs that shouldn't be there (probably better this way, as if we tried sync without the proper exclusion rules it would start deleting the majority of the data in the buckets!).

For the Zarr stores, this shouldn't be a problem, as those are definitely handled with a sync command - one thing to note is that Rclone waits until all new files are copied before the deletions happen, so if one job is clogged up with terabytes of new files to copy to the destination and continuously times out, it may never get the chance to delete old files...

My though is that once we properly change the prefixes for all the jobs such that the CMIP6 is properly distributed across several runners instead of just one, the remainder job should be able to handle any old misplaced Zarr stores, as usually there is very little "new" data outside of the CMIP6/ prefix.

In this case, I'll delete the old catalogs from S3 and research exclusion rules to see if there's a way to "sync" the catalogs (copy and delete if necessary) instead of just copying them.

naomi-henderson commented 3 years ago

Thanks @charlesbluca ! I have modified the scripts - so lets give it a try?

naomi-henderson commented 3 years ago

I think we could probably just delete the extra prefixes (sp?) as they finish being moved into the CMIP6 prefix. Note that the DCPP prefix is NOT being moved into CMIP6 since I have not yet decided what to do about these many, many small zarr stores ...

naomi-henderson commented 3 years ago

@charlesbluca , when you get a chance, these files should be removed:

s3://cmip6-pds/cmip6.csv
s3://cmip6-pds/pangeo-cmip6-testing.csv
s3://cmip6-pds/pangeo-cmip6-testing.json

Perhaps I should read up on the AWS CLI (at least the S3 commands) and get the credentials from you?

naomi-henderson commented 3 years ago

thanks, @charlesbluca , looks like we can close this now!