This repo contains the files necessary to transform the data from partner digitization projects into a format compliant with the data scheme to import to the Description and Authority Service (DAS) for inclusion in the National Archives Catalog.
Download this repo as it exists for a working directory.
Partner XML metadata for each microfilm publicaton must go in the metadata folder. Samples for a publication can be found in the metadata folder here.
The CSV file generated by the S3 Manifester must go in the objects folder. Samples for a publication can be found in the objects folder here.
Python scripts must be modified for each new instance. Notes for where to modify scripts can be found below.
All scripts in this repo are written in Python 2. If you are working in Python 3, use these scripts.
Python scripts must be executed in the following order:
pip install boto3
and pip install awscli
. Once installed, configure your AWS credentials with the command aws configure
.Change the S3 bucket name:
bucket = s3.Bucket(name='NARAprodstorage')
Change the target file name:
with open('m384_objects.csv', 'wt') as log :
Change the S3 directory:
for obj in bucket.objects.filter(Prefix='lz/microfilm-publications/M1064_LettrsRecdCommBranch1863-1870'):
Change the target file name:
with open('m384_copy.csv', 'r') as log :
Change the number of rows to match the number of columns on the original csv:
writelog.writerow( (row[0], row[1], row[2], row[3], row[4], row[5], row[6], row[7] ) )
Change the series NAID:
series = 586957
Change the microfilm publication number:
pub = 'M384'
Ensure the xml tags for r.replace match the metadata:
try:
with open('metadata/' + file + '_metadata.xml', 'r') as y :
r = re.sub('<metadata name=\"(.*?)\" value=\"(.*?)\" />',r'<\1>\2</\1>', y.read())
r = r.replace('Publication Number','Publication_Number')
r = r.replace('Publication Title','Publication_Title')
r = r.replace('Content Source','Content_Source')
z = open(file + '_metadata_(reformatted).xml', 'w')
z.write(r)
z.close()
except IOError:
print ' Error: ROLL NOT FOUND'
x = x + 1
continue
tree = ET.parse(file + '_metadata_(reformatted).xml')
root = tree.getroot()
Publication_Number = root.find('Publication_Number').text
Publication_Title = root.find('Publication_Title').text
print str(datetime.datetime.now().time()) + ': ' + Publication_Number, Publication_Title, 'Roll ' + str(roll)
Ensure the data values match the metadata:
try:
for page in root.findall('page'):
with open('objects/' + file + '.csv', 'r') as log :
readfile = csv.reader(log, delimiter= '\t')
file_name = ''
id = ''
givenname = '[BLANK]'
surname = '[BLANK]'
age = '[BLANK]'
year = '[BLANK]'
military_unit = '[BLANK]'
file_size = ''
file_name = page.get('image-file-name')
id = page.get('footnote-id')
if page.find('givenname') is not None:
givenname = page.find('givenname').text
if page.find('surname') is not None:
surname = page.find('surname').text
if page.find('age') is not None:
age = page.find('age').text
if page.find('year') is not None:
year = page.find('year').text
if page.find('military-unit') is not None:
military_unit = page.find('military-unit').text
Ensure the csv row numbers are accurate:
for row in readfile:
try:
if new_file_name == row[7]:
if file == row[4]:
file_size = str(row[1])
file_path = row[0]
label_flag = row[7]
except IndexError:
pass
Modify the title string as appropriate:
title = ('[Maryland] ' + surname + ', ' + givenname + ' - Age ' + age + ', Year: ' + year + ' - ' + military_unit).encode('utf-8')
Update the microfilm publication information:
<microformPublicationArray><microformPublication><note>The start of this file can be found on Roll """ + str(roll) + """.</note><publication><termName>M384 - Compiled Service Records of Volunteer Union Soldiers Who Served in Organizations From the State of Maryland.</termName></publication></microformPublication></microformPublicationArray>
file = 'm384-import-1.xml'
The following files must be in the working directory as they exist here: