PDC Describe Production set and ready for data migration

bess commented 2 years ago

Once the data migration starts, we want to be very clear that the data in production is our actual production data.

[x] Delete all data sets
[x] Delete all content from the prod S3 bucket
[x] Change the datacite keys to utilize production
[x] Change production configuration to migrate Arks

Blocked until we're happy with the migration specs in staging.

carolyncole commented 2 years ago

@bess When does Step #9 in https://github.com/pulibrary/rdss-handbook/blob/main/globus.md#making-a-new-s3-bucket happen? Aka should it happen before the migration or are we migrating the files one at a time?

carolyncole commented 2 years ago

Hey team! Please add your planning poker estimate with Zenhub @hectorcorrea @jrgriffiniii

bess commented 2 years ago

@carolyncole Great question. I have gone back and forth on this. When I wrote that guide to syncing S3 buckets I was picturing that we would make a new Globus service, spin up an empty S3 bucket for it, and then synch the data. However, the more I understand about our projects and the state of the legacy data the more I lean toward migrating the files one by one. The legacy Globus S3 bucket contains lots of empty folders named with DOIs. That's going to make it very hard to tell when we've correctly migrated any given work. I now believe that a better workflow would be, for each object:

1) Describe the record in pdc_describe 2) Submit the data payload via pdc_describe a) via the form for objects < 100Mb b) upload directly via Globus or S3 for big data 3) A curator checks the DataCite record that we produce and refines as necessary 4) Upon approval, the data payload is moved to the new S3 bucket and deleted from the legacy location a) This will also require indexing into PDC Discovery and updating the download link there

QUESTION: Should there also be a step where we manually delete the migrated object from DataSpace? I'd like to, if possible, because it will make it more clear that PDC Describe is the canonical copy, and it would prevent potential snarls where we've already migrated a record but then someone goes in and makes changes to the DataSpace record and those changes aren't persisted to PDC Describe. However, it's hard to advocate for deleting a record from DataSpace before we have very good assurance that PDC Describe is a fully operational production system with reliable backup and restore. So maybe that's a thing to prioritize?

carolyncole commented 1 year ago

While we were cleaning out staging we utilized production to test our new rake tasks in #707.

We deleted all Works( datasets) and data from S3.

carolyncole commented 1 year ago

Changing over the DOI to production can be accomplished by modifying this file: https://github.com/pulibrary/princeton_ansible/blob/main/group_vars/pdc_describe/production.yml#L41-L50

We are going to wait until a round of test migration has occurred before making this change.

pulibrary / pdc_describe

PDC Describe Production set and ready for data migration #409