usnistgov / NIST-Tech-Pubs

XML metadata for NIST Technical Series Publications
https://pages.nist.gov/NIST-Tech-Pubs/
18 stars 8 forks source link

Deduplication of records #38

Closed tkphd closed 1 year ago

tkphd commented 1 year ago

This PR removes duplicate records for numerous NIST publications from allrecords.xml. The process used was:

  1. Generate a list of duplicated DOIs
  2. Open the allrecords.xml file
  3. For each duplicated DOI:
    1. Find the first instance of the DOI
    2. Remove the record (everything in the <query>...</query> block)
    3. Search for the DOI to ensure it is present in a "later" query block
    4. Save the file
    5. Commit the changes for the specific DOI ("atomic" commits)
    6. Loop until all duplicate records have been removed

This workflow should have preserved the latest version of each record matching a duplicated DOI. If there are errors, the atomic commits should make it easy to revert a specific change. The downside of this is the large number of commits associated with this PR. For that reason, if this PR is accepted, please use a squash merge to combine the atomic changes into a single commit representing all the changes.

kmiller621 commented 1 year ago

@tkphd All the duplications were deleted with the latest update of allrecords.xml. I think this PR can be closed, but submit again if you find more.