Closed bess closed 1 year ago
@bess When does Step #9 in https://github.com/pulibrary/rdss-handbook/blob/main/globus.md#making-a-new-s3-bucket happen? Aka should it happen before the migration or are we migrating the files one at a time?
Hey team! Please add your planning poker estimate with Zenhub @hectorcorrea @jrgriffiniii
@carolyncole Great question. I have gone back and forth on this. When I wrote that guide to syncing S3 buckets I was picturing that we would make a new Globus service, spin up an empty S3 bucket for it, and then synch the data. However, the more I understand about our projects and the state of the legacy data the more I lean toward migrating the files one by one. The legacy Globus S3 bucket contains lots of empty folders named with DOIs. That's going to make it very hard to tell when we've correctly migrated any given work. I now believe that a better workflow would be, for each object:
1) Describe the record in pdc_describe 2) Submit the data payload via pdc_describe a) via the form for objects < 100Mb b) upload directly via Globus or S3 for big data 3) A curator checks the DataCite record that we produce and refines as necessary 4) Upon approval, the data payload is moved to the new S3 bucket and deleted from the legacy location a) This will also require indexing into PDC Discovery and updating the download link there
QUESTION: Should there also be a step where we manually delete the migrated object from DataSpace? I'd like to, if possible, because it will make it more clear that PDC Describe is the canonical copy, and it would prevent potential snarls where we've already migrated a record but then someone goes in and makes changes to the DataSpace record and those changes aren't persisted to PDC Describe. However, it's hard to advocate for deleting a record from DataSpace before we have very good assurance that PDC Describe is a fully operational production system with reliable backup and restore. So maybe that's a thing to prioritize?
While we were cleaning out staging we utilized production to test our new rake tasks in #707.
We deleted all Works( datasets) and data from S3.
Changing over the DOI to production can be accomplished by modifying this file: https://github.com/pulibrary/princeton_ansible/blob/main/group_vars/pdc_describe/production.yml#L41-L50
We are going to wait until a round of test migration has occurred before making this change.
Once the data migration starts, we want to be very clear that the data in production is our actual production data.
Blocked until we're happy with the migration specs in staging.