pulibrary / figgy

Valkyrie-based digital repository backend.
Other
36 stars 4 forks source link

Ingest WA-WC055 #1701

Closed tpendragon closed 4 years ago

tpendragon commented 6 years ago

Child of #1581 The album has a MARC record, that record has two volumes, each page from each volume that has photos has a MODS record for every photo.

escowles commented 4 years ago

Based on discussion with Gabriel Swift, it would be better to import metadata from PULFA for the WA photographs than continue to use the forked MODS records. These records should be ingested with the PULFA component ID to use imported metadata.

escowles commented 4 years ago

METS files are in the Plum/Figgy/DPUL team drive: https://drive.google.com/drive/u/1/folders/1BQNa-sKpNTj0Ys3jcNT188ovz_4W834O

cwulfman commented 4 years ago

@escowles wrote:

received a request to update one of the western americana items from Sara Logue, who said a user alerted her to a problem with this item: I noticed the catalog description doesn't match the caption on the photo. http://arks.princeton.edu/ark:/88435/n583xv70q

given the discussions have had about those items, just ingesting WC055 first and taking that item down maybe makes more sense

unfortunately, i'm looking at the data, and the only match point i can find between the EAD (which has the component IDs we need to link to) and the MODS (which has the ARKs we need to update) is this:

from the EAD, c/did/container: Volume Folio 2, Leaf 3, Photograph q

from the MODS, location/holdingSimple/shelfLocator: (WA) WC055, Folio 2, Leaf 3, Photograph q

the MODS file is also pudl0017/wc055/1139550/002/00000005/q.mets, so maybe it's possible to turn the EAD structure into that....

the one thing that's definitely important is to ingest the items with the ARKs (or add them before making the items public) — that is the key to having figgy update the existing ARKs instead of minting duplicate ones

here's the rake task for doing bulk ingest: https://github.com/pulibrary/figgy/blob/master/lib/tasks/bulk.rake#L20 — you likely want to write a custom script or update this rake task to allow also specifying the ARKs

cwulfman commented 4 years ago

As @escowles points out above, the key is the the EAD container and the MODS shelfLocator. With a few regular expressions in XQuery, I was able to match every container with a METS and, therefore, a path to an image. Next step is to use that data in a bulk ingest:

  1. Ingest TIFFs
  2. Obtain ark
  3. Update EAD with <dao> elements
cwulfman commented 4 years ago

Trying to nail down the proper invocation of bulk:ingest. The line below isn't quite right:

rake bulk:ingest DIR=/mnt/diglibdata/pudl/pudl0017/wc055/1139550/002/00000063/m BIB=WC055_c0581 OBJID=ark:/88435/4j03d0376

escowles commented 4 years ago

@cwulfman That invocation looks good to me, other than probably wanting to add COLL=e201720c-e1bb-48eb-854d-5bc0aa6f57c6 to add the items to the WA collection. Are you getting any errors or seeing unexpected results?

cwulfman commented 4 years ago

Adding the COLL and the FILTER ENV variables seems to have solved the problem.

cwulfman commented 4 years ago

All the WC055 assets seem to have been ingested and are pending: https://figgy.princeton.edu/catalog?utf8=✓&q=WC055

cwulfman commented 4 years ago

Invocation in the following form still doesn't work:

bundle exec rake bulk:ingest DIR=/mnt/diglibdata/pudl/pudl0017/wc055/1139550/001/00000001/a BIB=WC055_c0001 OBJID=ark:/88435/g158bj01c COLL=e201720c-e1bb-48eb-854d-5bc0aa6f57c6 FILTER=[*.tif]

MODEL defaults to ScannedResource (which is correct); it isn't replacing any existing ScannedResource, so there should be no value for REPLACES. LOCAL_ID, in the rspec test (ingest_folder_job_spec.rb), is set to a fake Cicognara number (cico:xyz); it isn't clear where that applies here, unless it duplicates the BIB variable (it is the id of the container.