uga-libraries / format-report

Aggregate and analyze csv files with file format information generated by the UGA Libraries' digital preservation system (ARCHive).
Creative Commons Attribution Share Alike 4.0 International
0 stars 0 forks source link

Digital Preservation Format Report

Overview

Analyze format information from the UGA Libraries' digital preservation system (ARCHive), across the entire system and for individual departments, to monitor them for preservation risks.

The format data is either generated by FITS or MediaInfo and is consistently formatted in a CSV by ARCHive. Risk is added from the US National Archive's (NARA) digital preservation risk data.

The analysis is accomplished with a series of scripts:

Getting Started

Dependencies

Installation

Download the latest version of NARA's Digital Preservation Plan spreadsheet (CSV version) from the U.S. National Archives Digital Preservation GitHub Repo.

From ARCHive, download every group's file format report, and the usage report (start of ARCHive - present), and save all of them to a single folder. This is the report_folder that is an argument for most scripts.

Script Arguments

All script arguments are required.

archive_reports.py

department_reports.py

fix_version.py

merge_format_reports.py

update_standardization.py

Testing

There are unit tests for all functions of each script, and for running each of the scripts in their entirety.

Tests use files stored in the repository for input data, so these need to be updated to sync with changes to the NARA Digital Preservation Plan spreadsheet, or the output of merge_format_reports.py, which is used as input for other scripts.

Workflow

For the full analysis, use the ARCHive reports workflow.

To generate a report for a specific department, use the department report workflow.

Author

Adriane Hanson - Head of Digital Stewardship at the University of Georgia Libraries

History

The first analysis was in 2020. It was repeated in 2021, but since there was not much change we adjusted the schedule to every two years. In both years, risk was evaluated by manually comparing the most common standardized format names to the NARA Digital Preservation Plan spreadsheet and only the ARCHive reports were created.

The third analysis was in 2023. The comparison to the NARA spreadsheet was partially automated, allowing us to compare every format version to NARA, and the department reports were added. This gives us more nuanced, actionable risk data.