pulibrary / figgy

Valkyrie-based digital repository backend.
Other
35 stars 4 forks source link

Ingest "Fitzgerald mss." (pudl0044) #1592

Closed tpendragon closed 3 years ago

tpendragon commented 5 years ago

Notes: MARC, but needs custom watermarks on JP2s - Intermediate TIFF solution?

The referenced intermediate TIFF solution is to allow Scanned Resources to have an Intermediate Tiff uploaded which JP2s should be created off of - in this case the intermediate tiffs will be watermarked.

tpendragon commented 5 years ago

Files: https://drive.google.com/drive/u/0/folders/0B4Wo5hgOEFY3ZlZGUU5OS3FIbDA

jrgriffiniii commented 5 years ago

There are three directories containing image files in the TIFF with identical file names:

$ ls -lt pudl0044/scrapbook3/foldouts/
total 452608
[...]
-rwxr-xr-x 1 deploy root 17775972 Oct 21  2002 DSC_0003.TIF
-rwxr-xr-x 1 deploy root 17780350 Oct 21  2002 DSC_0002.TIF
-rwxr-xr-x 1 deploy root 17770576 Oct 21  2002 DSC_0001.TIF
$ ls -lt pudl0044/scrapbook7/foldouts/
total 1897472
[...]
-rwxr-xr-x 1 deploy root 17779302 Oct 28  2002 DSC_0080.TIF
-rwxr-xr-x 1 deploy root 17775526 Oct 28  2002 DSC_0079.TIF
-rwxr-xr-x 1 deploy root 17774608 Oct 28  2002 DSC_0078.TIF
$ ls -lt pudl0044/zelda7/foldouts/
total 1880064
[...]
-rwxr-xr-x 1 deploy root 17774532 Oct 21  2002 DSC_0003.TIF
-rwxr-xr-x 1 deploy root 17770976 Oct 21  2002 DSC_0002.TIF
-rwxr-xr-x 1 deploy root 17774386 Oct 21  2002 DSC_0001.TIF

I'm uncertain as to how this should be reconciled, as the METS within scrapbook.mets uses the same ID (pudl/pudl0044/831958/) for both directories:

<mets:file CHECKSUM="5cd94bd02c37b414b74111248ccf96e0" CHECKSUMTYPE="MD5" ID="ycc7i" MIMETYPE="image/tiff">
<mets:FLocat LOCTYPE="URL" xlink:href="file:///mnt/diglibdata/pudl/pudl0044/831958/scrapbooks/scrapbook_01/00000136.tif"/>
</mets:file>
<mets:file CHECKSUM="f76c81cba59236ae9cbe9bb5e57bfa95" CHECKSUMTYPE="MD5" ID="gbw01" MIMETYPE="image/tiff">
<mets:FLocat LOCTYPE="URL" xlink:href="file:///mnt/diglibdata/pudl/pudl0044/831958/scrapbooks/scrapbook_02/00000001.tif"/>
</mets:file>

After further review on my part, these should likely just be ingested within the directories bearing bib. IDs in the PUDL directory.

jrgriffiniii commented 5 years ago

Reviewing the MODS metadata, I found that the following attributes are not featured within the referenced MARC record (831958):

Field Element Language Script XPath Value Authorities/Encoding Standards MARC Liberation JSON-LD Property MARC Liberation JSON-LD Value
Title titleInfo English Latin mods:mods/mods:titleInfo/mods:title Fitzgerald's Trimalchio   Not present Not present
Alternative Title titleInfo English Latin mods:mods/mods:titleInfo/mods:title[type="alternative"] Trimalchio   Not present Not present
Title titleInfo English Latin mods:mods/mods:titleInfo/mods:title Great Gatsby NAF Not present Not present
Author namePart English Latin mods:mods/mods:name/mods:role/mods:roleTerm[text()="aut"]/../../mods:namePart F. Scott (Francis Scott) Fitzgerald 1896-1940   author Fitzgerald, F. Scott (Francis Scott), 1896-1940
Type of Resource typeOfResource English Latin mods:mods/mods:typeOfResource text   Not present Not present
Date Created dateCreated English Latin mods:mods/mods:originInfo/mods:dateCreated 1924-1925 w3cdtf date 1897-1944.
Language language English Latin mods:mods/mods:language/mods:languageTerm     language eng
Extent extent English Latin mods:mods/mods:physicalDescription/mods:extent     extent 44 linear ft. (89 archival boxes, 11 oversize flat cases)
Note note English Latin mods:mods/mods:note     Not present Not present
Subject subject English Latin mods:mods/mods:subject/mods:genre Manuscripts LCSH type Correspondence, Manuscripts
Subject subject English Latin mods:mods/mods:subject/mods:name/mods:namePart F. Scott (Francis Scott) Fitzgerald 1896-1940 LCSH Not present Not present
Collection collection English Latin mods:mods/mods:relatedItem[@type="host"]/mods:titleInfo/mods:title F. Scott Fitzgerald papers, 1897-1944   Not present Not present
Use Rights accessCondition English Latin mods:mods/mods:accessCondition[@type="useAndReproduction"] Selected items in the F. Scott Fitzgerald Papers can be photoduplicated at the expense of the researcher requesting photoduplication. Advanced estimates and payment are required. For general information on photoduplication and permissions, go to http://www.princeton.edu/~rbsc Requests to to reproduce, publish, or broadcast material from the F. Scott Fitzgerald Papers should be addressed Public Services staff, rbsc@princeton.edu The correct form of citation includes the name of the collection, box and folder numbers, and an indication that the originals are in the "Manuscripts Division, Department of Rare Books and Special Collections, Princeton University Library." The manuscript of The Great Gatsby and other writings of F. Scott Fitzgerald are not to be quoted, published, reproduced, or broadcast without the written permission of the Princeton University Library as owner of the physical object, and of the Fitzgerald Literary Trust (copyright holder), c/o Harold Ober Associates, 425 Madison Avenue, New York, New York 10017 (Telephone: 212-759-8600; FAX: 212-759-9428). The Library is not responsible for copyright infringement or other legal problems resulting from unauthorized publication of the words of F. Scott Fitzgerald.   Not present  
Access Restrictions accessCondition English Latin mods:mods/mods:accessCondition[@type="restrictionOnAccess"] For legal and conservation reasons, access to F. Scott Fitzgerald’s original manuscripts (including corrected galleys and scrapbooks) is strictly restricted. Scottie Fitzgerald Lanahan, daughter of F. Scott Fitzgerald and Zelda Fitzgerald, donated the Fitzgerald Papers to the Princeton University Library in 1950, stipulating that surrogates of the original manuscripts were to be made available to researchers instead of the originals. This was done to preserve the originals, which are not on good paper. Originally, the surrogates were in the form of microfilm. A facsimile edition of The Great Gatsby autograph manuscript was published in 1973: The Great Gatsby: A Facsimile of the Manuscript, edited with an introduction by Matthew J. Bruccoli (Washington, D.C.: Microcard Editions Books, 1973). Facsimiles editions of other manuscripts of books and short stories followed a multi-volume series: F. Scott Fitzgerald Manuscripts, edited by Matthew J. Bruccoli and Alan Margolies (New York: Garland Publishing Company, 1990). Complete sets of the facsimile edition are available at more than 50 research libraries (including Firestone Library). The present digital surrogates of The Great Gatsby manuscript and corrected galleys are part of this effort and are being put online, using digital watermarks, with the permission of the Fitzgerald Literary Trust (the Fitzgerald copyright holder), c/o Harold Ober Associates, the New York literary agency.   Not present  
Abstract abstract English Latin mods:mods/mods:abstract F. Scott Fitzgerald, This Side of Paradise, autograph manuscripts and corrected typescripts (1917-1919). Fitzgerald began writing This Side of Paradise at Princeton, continued in November 1917 at Fort Leavenworth, Kansas, with the working title "The Romantic Egoist," and completed a first draft of the novel at Cottage Club in March 1918. After this draft had been twice rejected by the New York publisher Charles Scribner's Sons, Fitzgerald returned to his parents' home at 599 Summit Avenue in his native St. Paul, Minnesota, and added five new chapters to the four he had written the previous year. He changed the second title of the novel from "The Education of a Personage" to "This Side of Paradise" and sent the novel to Maxwell Perkins at Scribner's. The publisher accepted the novel on September 16, 1919, and published it on March 26, 1920. The author's corrected galleys and page proofs do not survive.   Not present Not present
Table of Contents tableOfContents English Latin mods:mods/mods:tableOfContents I. This Side of Paradise (1920).   Not present Not present

What is unclear to me is whether this should be resolved by providing the additional information in a separate MARC record and linking to that during ingestion, or whether or not this should be parsed from the METS/MODS (please see #1705)

jrgriffiniii commented 5 years ago

Currently the watermarked image files in the TIFF are located within the following directories on the Samba network share:

$ ls -lt //[HOST].princeton.edu/pudl/pudl0044/823463/
total 3152896
-rwxr-xr-x 1 deploy www-data 103119252 May 22  2013 00000009.tif

...whereas the original files can be found within a separate directory originals_no_watermark:

$ ls -lt //[HOST].princeton.edu/pudl/pudl0044/originals_no_watermark/823463/
total 3152896
-rwxr-xr-x 1 deploy www-data 111132340 May 14  2013 00000030.tif
jrgriffiniii commented 5 years ago

bundle exec rake bulk:ingest DIR=staged_files/pudl/pudl0044/823463 BIB=823463 REPLACES=pudl0044 COLL=b9436097-6999-475a-a77c-c664b4d67607 followed by bundle exec rake bulk:ingest_intermediate_files DIR=staged_files/pudl/pudl0044/originals_no_watermark

Successfully ingests the material and appends the images in the TIFF without watermarks as intermediary files.

jrgriffiniii commented 5 years ago

Further testing today reveals that the TIFF files are not ingested as intermediate or original files, and that the newly generated intermediate JP2 files are not accessible:


Valkyrie::StorageAdapter::FileNotFound in DownloadsController#show
[...]
    trace("valkyrie.storage.find_by") do |span|
      span.set_tag("param.id", id.to_s)
      storage_adapter.find_by(id: id)
    end
  end
jrgriffiniii commented 5 years ago

These errors now seem to be inconsistent, and are often remedied by invoking binding.pry during the Rake task, after the jobs have been completed. Hence, it may be related to a race condition introduced in #1730

jrgriffiniii commented 5 years ago

Following the merging of #1913, test materials are ingested as expected when the following Rake tasks are invoked:

bundle exec rake bulk:ingest DIR=staged_files/pudl/pudl0044/originals_no_watermark/823463 BIB=823463 REPLACES=pudl0044 COLL=d2851940-20b5-4ea8-9cb9-216be2738a3c
bundle exec rake bulk:ingest_intermediate_files DIR=staged_files/pudl/pudl0044/
jrgriffiniii commented 5 years ago

These appear to ingest properly within the staging environment, however the metadata for 831958 seem to be invalid:

figgy_issues_1592_2

This is also the case for 831959:

figgy_issues_1592_0

Additionally, the derivative generation consistently fails for the member ScannedResource "scrapbooks":

figgy_issues_1592_1
cwulfman commented 3 years ago

all have been migrated