uga-libraries / format-report

Aggregate and analyze csv files with file format information generated by the UGA Libraries' digital preservation system (ARCHive).
Creative Commons Attribution Share Alike 4.0 International
0 stars 0 forks source link

Automatically match formats to NARA #6

Closed amhanson9 closed 1 year ago

amhanson9 commented 1 year ago

Use https://github.com/uga-libraries/accessioning-scripts to make automatic match of all format identifications to NARA Digital Preservation Framework. Might also use some of the summaries from that script.

amhanson9 commented 1 year ago

Automatic matching can result in multiple possible matches in NARA for one format identification, which adds more inflation (files in the spreadsheet more than once). I can think of two possible approaches right now:

1) Make two sets of the format CSVs before making the summaries, one with and one without NARA risk. Only use the ones with NARA risk for summaries that require risk information.

2) Add NARA risk to the current format CSVs and manually review to select the correct match when there is more than one (file path and format id will be duplicates).

Given the importance of risk information and that we only run this report once every 2 years, it is probably worth the time to do option 2. That way, risk information can be included in all summaries without added inflation and we may learn more about how to refine NARA matches automatically by looking at the information in more detail.

amhanson9 commented 1 year ago

Using two functions from accessioning-scripts: csv_to_dataframe(), which includes error handling for encoding errors, and match_to_nara(). I copied them to merge_format_reports.csv so that I can update them to fit this script's inputs and needs.

amhanson9 commented 1 year ago

ARCHive format information does not include the file extension, so it is not going to automatically match as many in NARA as the accessioning script. Most of the non-PUID matches during accessioning are by file extension. Matches by file extension are also responsible for the most times multiple NARA rows match a format identification, so this will reduce the time spent on removing multiple matches. But more time will be required to check for format names that are not exact matches but are clearly the same format, to get risk foras many formats as possible.

amhanson9 commented 1 year ago

Added a second argument to the script to get the path to the NARA risk information CSV. It didn't seem right to keep a copy of their CSV in our repo too (which would give it a predictable path) and the CSV name changes with each new version, so I didn't want to have to keep updating .gitignore. To verify the arguments, I make a new check_arguments() function.