mc2-center / mc2-center-dcc

Data coordination resources for CCKP (and MC2 in general)
0 stars 0 forks source link

Perform second round of backpopulation using updated CCKP database #58

Closed Bankso closed 2 months ago

Bankso commented 6 months ago

Since the CCKP staging union tables were qc'd and harmonized during our database update work, we should take the current database and re-backpopulate, to ensure parity between displayed and stored metadata at all database levels.

Process: Download CCKP database tables as CSV --> extract text from CSV using regex --> map all metadata onto current template --> apply processing, validation, upload pipeline --> done!

aclayton555 commented 4 months ago

Not pressing, but consider picking this up late 24-5 sprint. Complements current upload process. These are mostly quality of life improvements.

aclayton555 commented 3 months ago

Will be making some updates to QC checks on database to reduce scope on only newer entries rather than the WHOLE database, which takes a lot of time.

Bankso commented 2 months ago

Updated union_qc script is here: https://github.com/mc2-center/mc2-center-dcc/tree/union-qc-update-issue-58

Updates introduce steps to 1) pull the latest CCKP database CSV, 2) compare select fields between the database and new CSVs, 3) produce a CSV of new/"updated" entries, 4) validate new/"updated" entries

During testing, I was able to reduce validation of the publications union table to a few minutes after implementing these changes.

Similar to the previous workflow, the validation report can be used to check for issues in the new/"updated" entries. After making changes to the erroneous entries in the merged manifest, this will represent the new CCKP database CSV. I recommend that the merged + corrected manifest be compared to the CCKP database, just to be safe, before uploading to CCKP - Admin.

I'll also note that "updated" is used loosely here, since the script is really just checking if the fields match up perfectly. In some cases, subsequent inspection showed that many "updated" entries were identical to existing entries - my guess is that the differences arise from character encoding similar issues.

With this updated script in-hand, I will proceed with the second round of backpopulation

Bankso commented 2 months ago

Backpopulation has been completed for:

Bankso commented 2 months ago

Summary of work under this ticket: