uga-libraries / accessioning-scripts

Scripts used for accessioning born-digital archives
Creative Commons Attribution Share Alike 4.0 International
9 stars 1 forks source link

Accessioning Scripts

Overview

Scripts used for accessioning born-digital archives at the UGA Special Collections Libraries. Workflow documentation can be found in the born-digital-accessioning repo.

See sample-output for examples of the reports generated by these scripts.

Getting Started

Dependencies

Installation

The typical directory structure for accessions is as follows:

For format analysis:

  1. Download the latest version of NARA's Digital Preservation Plan spreadsheet (CSV version) from the U.S. National Archives Digital Preservation GitHub Repo and save it to your local copy of the accessioning-scripts directory

  2. Create a file named configuration.py from the configuration_template.py in the accessioning-scripts repo and add the appropriate file paths

Using the Scripts

find-long-paths.py

This is a standalone script that identifies and creates a CSV log of all the files in an accession with file paths that exceed the Windows maximum of 260 characters. These long file paths need to be identified and shortened prior to bagging the accession, otherwise they will raise permissions errors from bagit.py.

Script output is a CSV file called file-path-changes.csv that can be used as a change log to document the new shortened paths.

format-analysis.py

This script extracts technical metadata from files in the accession folder, compares it to multiple risk criteria, and produces a summary report to use for appraisal and evaluating an accession's complexity.

It is designed to be run repeatedly as the archivist makes changes based on the report information, e.g. deleting files from the accession or editing the risk data CSV.

If output files such as FITS XMLs or the risk spreadsheet already exist from previous iterations of the script, format-analysis.py will reuse that data. This saves time and also retains any manual updates that may have been made to those files.

If changes are made to the files in the accession folder, the full risk summary CSV must be deleted before re-running the script so that the CSV can be re-generated with up-to-date format data.

If there are no script-generated files present, the script will:

If there is a folder of FITS XML files, the script will:

If there is a risk spreadsheet, the script will:

technical-appraisal-logs.py

This script generates a CSV manifest of all the digital files received in an accession. It also identifies file paths that may break other scripts due to length or special characters and saves those paths to a separate log for review.

This script is also intended to be run after a round of technical appraisal. Using the "compare" argument compares the initial manifest to the files in the accession. The script will then generate a CSV log of any files that have been deleted. If a deletion log already exists from an earlier iteration, running the script again will add any additional deletions to the existing log.

Script output is an initial manifest CSV and a "files to review" CSV log. If using the "compare" argument, the only output is a deletion log CSV.