pulibrary / pdc_describe

Description application for Research Data content
7 stars 1 forks source link

Duplicate Files in Migrated Items #1918

Open sec122 opened 2 months ago

sec122 commented 2 months ago

Duplicate Files in Migrated Items

Expected behavior

Only one copy of each file should be present in migrated objects.

Actual behavior

Some files are duplicated in migrated datasets, easily noticed by filenames with a prefix of either "dataspace" or "globus"

These duplicated file cases fall into three groups: a) Only the README files are duplicated - 22 cases b) All files are duplicated (and there are TAR files) - 6 cases In a separate ticket #1920 ^ Matt has more info about how we want to handle these cases c) All files are duplicated (and there are no TAR files) - 1 case

Steps to replicate

View the full list of items (color coded in red) on the NEEDS ATTENTION tab of the "Copy of RDOS Records in DataSpace" google sheet https://docs.google.com/spreadsheets/d/130B7RMhnqSeTIKPFBdDsrSVbC1C_PCdZp0qwTucR0QA/edit?usp=sharing

Issue type = "Duplicate Files beyond Readmes" and "Duplicate Readmes Only" for specific examples and links to the records.

Impact of this bug

We cannot approve these datasets until the issue is fixed. Therefore, these records remain in DataSpace until the issue is resolved.

Honeybadger link and code snippet, if applicable

Implementation notes, if any

I believe @carolyncole may already have a script to take care of these issues - since she has fixed a very similar issue for us earlier in the migration. Unsure if it requires making a new similar script or rerunning the existing one though.

Acceptance criteria

carolyncole commented 1 month ago

Hey team! Please add your planning poker estimate with Zenhub @bess @hectorcorrea @JaymeeH @leefaisonr

carolyncole commented 1 month ago

Here is the output of the rake task showing that there are some actual checksum miss-matches between globus and datspace.

@sec122 the curators will need to look into these further.

442.txt 479.txt

matthewjchandler commented 1 month ago

@sec122 In the case of 479 mentioned above, I see the only difference is with the README file, and I can see in the one starting with "globus_" shows some character encoding errors when viewed in a browser (for me, at least). If you don't see any substantial differences between the two in terms of content, then I'd recommend we go with the one starting "dataspace".

matthewjchandler commented 1 month ago

@sec122 As for 442, that's more mysterious. Here the options I see:

  1. Leave all of the files in PDC for now, flag it for later review, and move on with the migration
  2. Make a judgment call to preserve the file set closest to what was originally uploaded (those starting with "dataspace" I believe)
  3. Go through all of the .nc files, figure out why the checksums don't match, and make a confident decision about what to keep and what to delete

One way or another, RDSS will need clear guidance from PRDS about what to keep and what to delete (if anything).

sec122 commented 2 weeks ago

@carolyncole Here are a set of items that just need the duplicate files and prefixes removed (see notes in our spreadsheet under the column "actions remaining" for specifics about each):

carolyncole commented 2 weeks ago

@sec122 Those updates are completed now.