Accessioning Scripts

Overview

Scripts used for accessioning born-digital archives at the UGA Special Collections Libraries. Workflow documentation can be found in the born-digital-accessioning repo.

See sample-output for examples of the reports generated by these scripts.

Getting Started

Dependencies

numpy (https://numpy.org/)
pandas (https://pandas.pydata.org/docs/)
bagit (https://libraryofcongress.github.io/bagit-python/)

Installation

The typical directory structure for accessions is as follows:

collection_id/name
- accession_id
  - accession_id_bag (or folder with unbagged accession contents)
    - media_folder1 (includes DMID)
    - media_folder2
  - preservation_log.txt
  - initialmanifest_YYYYMMDD.csv

For format analysis:

Download the latest version of NARA's Digital Preservation Plan spreadsheet (CSV version) from the U.S. National Archives Digital Preservation GitHub Repo and save it to your local copy of the accessioning-scripts directory
Create a file named configuration.py from the configuration_template.py in the accessioning-scripts repo and add the appropriate file paths

Using the Scripts

find-long-paths.py

Script usage: python /path/to/script /path/to/accession_folder

This is a standalone script that identifies and creates a CSV log of all the files in an accession with file paths that exceed the Windows maximum of 260 characters. These long file paths need to be identified and shortened prior to bagging the accession, otherwise they will raise permissions errors from bagit.py.

Script output is a CSV file called file-path-changes.csv that can be used as a change log to document the new shortened paths.

format-analysis.py

Script usage: python path/to/script path/to/accession_folder
Use an absolute path for the accession_folder. A relative path may prevent FITS XML from being generated.

This script extracts technical metadata from files in the accession folder, compares it to multiple risk criteria, and produces a summary report to use for appraisal and evaluating an accession's complexity.

It is designed to be run repeatedly as the archivist makes changes based on the report information, e.g. deleting files from the accession or editing the risk data CSV.

If output files such as FITS XMLs or the risk spreadsheet already exist from previous iterations of the script, format-analysis.py will reuse that data. This saves time and also retains any manual updates that may have been made to those files.

If changes are made to the files in the accession folder, the full risk summary CSV must be deleted before re-running the script so that the CSV can be re-generated with up-to-date format data.

If there are no script-generated files present, the script will:

Generate a FITS XML for every file in the accession folder
Generate a FITS summary CSV
Generate a full risk data CSV
Generate a format analysis spreadsheet

If there is a folder of FITS XML files, the script will:

Update the FITS XML and FITS summary CSV to match the files in the accession folder
Generate a full risk data CSV (if one is not already present)
Generate a format analysis spreadsheet

If there is a risk spreadsheet, the script will:

Use it to generate a format analysis spreadsheet (it is not automatically updated based on changes to the FITS summary CSV)

technical-appraisal-logs.py

Script usage: python /path/to/script /path/to/accession_folder [compare]

This script generates a CSV manifest of all the digital files received in an accession. It also identifies file paths that may break other scripts due to length or special characters and saves those paths to a separate log for review.

This script is also intended to be run after a round of technical appraisal. Using the "compare" argument compares the initial manifest to the files in the accession. The script will then generate a CSV log of any files that have been deleted. If a deletion log already exists from an earlier iteration, running the script again will add any additional deletions to the existing log.

Script output is an initial manifest CSV and a "files to review" CSV log. If using the "compare" argument, the only output is a deletion log CSV.

uga-libraries / accessioning-scripts

readme