usnationalarchives / partner-data-transform

Python scripts to transform partner data for upload to National Archives Catalog
15 stars 1 forks source link

partner-data-transform

This repo contains the files necessary to transform the data from partner digitization projects into a format compliant with the data scheme to import to the Description and Authority Service (DAS) for inclusion in the National Archives Catalog.

Download this repo as it exists for a working directory.

/metadata

Partner XML metadata for each microfilm publicaton must go in the metadata folder. Samples for a publication can be found in the metadata folder here.

/objects

The CSV file generated by the S3 Manifester must go in the objects folder. Samples for a publication can be found in the objects folder here.

Python scripts

Python scripts must be modified for each new instance. Notes for where to modify scripts can be found below.

All scripts in this repo are written in Python 2. If you are working in Python 3, use these scripts.

Python scripts must be executed in the following order:

  1. s3_file_list.py
    • This script generates a CSV file listing all the digital image filepaths for the specified directory with other relevant data to be used in the data transformation. For the script to work, you must install the boto3 Python module and the AWS Command Line Interface with the commands pip install boto3 and pip install awscli. Once installed, configure your AWS credentials with the command aws configure.
  2. s3_csv_split.py
    • This script takes the CSV file with all the digital image filepaths from the Amazon S3 cloud and breaks them out per microfilm roll.
  3. reformat_partner_xml.py
    • This script reformats the partner xml into a Description and Authority Service (DAS) xml format, then marries the xml with the digital object filepaths.
  4. combine_xml.py
    • This script combines the newly-generated XML files from reformat_partner_xml.py into files of 75 MB or less for import into DAS.

Modifications to scripts

s3_file_list.py

s3_csv_split.py

reformat_partner_xml.py

combine_xml.py

Other files

The following files must be in the working directory as they exist here: