pulibrary / figgy

Valkyrie-based digital repository backend.
Other
36 stars 4 forks source link

ingest PPPL technical reports into Figgy #5023

Open cwulfman opened 2 years ago

cwulfman commented 2 years ago

From @escowles

FYI, i've asked Peter Green to create some MARC records for about 1300 PPPL technical reports that need to be ingested — i have the files staged on the Isilon ingest_scratch share in /mnt/hydra_sources/ingest_scratch/pppl_technical_reports/staged — he should give you a mapping of newly-created MARC record MMS IDs to the call numbers, which are the directories there 10:56 the metadata is very thin, but hopefully enough to give basic access to the files (which are being removed from the PPPL website by the end of the month) — the metadata is in a spreadsheet, of course: https://docs.google.com/spreadsheets/d/1lApC21IrsX3BeXyyx0nT8LcW7SQEawyCf4tCsIbQ7co/edit#gid=1262367616 10:56 can you ingest the files and send Peter a mapping of MMS IDs to the figgy generated ARKs?

cwulfman commented 2 years ago

Approach:

  1. create new collection
  2. convert spreadsheet into a JSON-formatted SIP
  3. ingest from the SIP using a new rake task.

Method:

  1. created new collection e6080f72-4ba5-4e35-a87d-9a4eb826d0e3 from GUI
  2. used following python script to transform Google sheet into json document:
import json
import csv
import re
import os

tsv_path = "/home/deploy/pppl//records.tsv"
json_path = "/home/deploy/pppl/records.json"
isilon_path ="/mnt/hydra_sources/ingest_scratch/pppl_technical_reports/staged"
collection_id ="e6080f72-4ba5-4e35-a87d-9a4eb826d0e3"

records = []

with open(tsv_path) as f:
    reader = csv.DictReader(f, delimiter="\t")
    for row in reader:
        record = {}
        record['title'] = row['title'].strip()
        record['local_identifier'] = row['call number'].strip()
        record['path'] = os.path.join(isilon_path, row['call number'].strip())
        input_authors = row['authors'].split(',')
        authors = []
        for author in input_authors:
            author = author.strip()
            author = re.sub(r"^and ", "", author)
            authors.append(author)
        record['creator'] = authors
        record['member_of_collection_ids'] = [collection_id]
        records.append(record)
output = {}
output['records'] = records

with open(json_path, 'w') as fp:
    json.dump(output, fp)
  1. bundle exec rake figgy:import:json FILE=/home/deploy/pppl/records.json
escowles commented 2 years ago

I see these in Figgy as SimpleResources — but I thought the plan was to create MARC records for the metadata, since the expectation was that these would be available in Orangelight. Was there a change of plan? Was Anya Bartelmann OK with the change?

cwulfman commented 2 years ago

I have ingested the items in https://docs.google.com/spreadsheets/d/1lApC21IrsX3BeXyyx0nT8LcW7SQEawyCf4tCsIbQ7co/edit#gid=1262367616 as SimpleResources into https://figgy.princeton.edu/catalog/e6080f72-4ba5-4e35-a87d-9a4eb826d0e3 using the attached JSON file, which comprises data extracted from the spreadsheet tab “new records”.

What I need from @pmgreen is a mapping of local_identifier -> source_identifier.

When I have that, I’ll re-ingest the reports as ScannedResources with the updated metadata, and then send you a report with a mapping of source_identifier->ark, so you can update the MARC records. records.json.zip