Open mlhale7 opened 1 month ago
According to Repox (which we use for DPLA harvesting), our URL does not exist (I've tried both http and https just in case).
An update:
Using the harvest (https://github.com/vphill/pyoaiharvester) as it was not working because https://digitalcollections.lib.utk.edu/ still has the HTTP Auth (where you need to enter the username and password before you can see the site). Modifying the harvester's code to allow you to pass it in via the URL like http://username:password@example.com/ will get past that issue.
However, we are quickly met with a different issue. There seems to be an issue with the SSL certificates.
Details in this report.
Rob said he will address the cert issue. After that is fixed we can try the harvester tool again.
To summarize the findings here:
There are two things at play here that makes the pyoaiharvester
not work with the production site.
digitalcollections
tenant has basic auth turned on (the pop up where you are prompted for the username/password). The pyoaiharvester
tool does not allow passing in basic auth through the URL. The assumption is that this will be turned off for launch so this should not be an issue.utklibraryoai
. We are investigating the specific reasons for this block and assessing any necessary adjustments to our security configuration.Thanks to @kirkkwang, I was able to successfully pull records in oai_dc and mods format. I'll comment back once I inspect these files more.
This may need to be a separate ticket, but I'm finding some odd records in the OAI. For instance:
<record>
<header>
<identifier>oai:hyku:9a77b15d-554d-4dfc-a49f-09fbcee8118c</identifier>
<datestamp>2024-07-22T23:24:01Z</datestamp>
<setSpec>collection:admin_set/default</setSpec>
</header>
<metadata>
<oai_dc:dc
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:description>OCR for alumnus:1507298774</dc:description>
<dc:title>OCR</dc:title>
</oai_dc:dc>
</metadata>
</record>
AND
<record>
<header>
<identifier>oai:hyku:2f4595f1-7611-490a-b4b6-94336055d037</identifier>
<datestamp>2024-07-22T23:23:58Z</datestamp>
<setSpec>collection:admin_set/default</setSpec>
</header>
<metadata>
<oai_dc:dc
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:description>TRANSCRIPT for jsevier:9</dc:description>
<dc:title>TRANSCRIPT</dc:title>
</oai_dc:dc>
</metadata>
</record>
<record>
<header>
<identifier>oai:hyku:f47c15e5-5055-44b7-9b55-526b3e3bfc68</identifier>
<datestamp>2024-07-22T23:23:58Z</datestamp>
<setSpec>collection:admin_set/default</setSpec>
</header>
<metadata>
<oai_dc:dc
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:description>TEI for jsevier:9</dc:description>
<dc:title>TEI</dc:title>
</oai_dc:dc>
</metadata>
</record>
<record>
<header>
<identifier>oai:hyku:1211e8b1-040f-439d-b5d0-f1ad54f42e8d</identifier>
<datestamp>2024-07-22T23:23:59Z</datestamp>
<setSpec>collection:admin_set/default</setSpec>
</header>
<metadata>
<oai_dc:dc
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:description>OBJ for jsevier:25</dc:description>
<dc:title>OBJ</dc:title>
</oai_dc:dc>
</metadata>
</record>
<record>
<header>
<identifier>oai:hyku:72f7889f-da05-4363-a42a-c74354351672</identifier>
<datestamp>2024-07-22T23:24:00Z</datestamp>
<setSpec>collection:admin_set/default</setSpec>
</header>
<metadata>
<oai_dc:dc
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:description>OBJ for jsevier:2</dc:description>
<dc:title>OBJ</dc:title>
</oai_dc:dc>
</metadata>
</record>
I'm including @laritakr here for informational purposes. None of these types of resources (OBJ, Transcript, OCR, etc) should be present in OAI. We just want the main record. Getting rid of all of these extra records would also make pulling OAI a lot faster. Ultimately the issue is that I would need to find a way to exclude all of these extra attachments for DPLA ingests etc. if not removed as this information is not needed.
Here's another odd record:
<record>
<header>
<identifier>oai:hyku:24ac22aa-106b-4a88-a346-9e264d13d972</identifier>
<datestamp>2023-09-01T03:59:26Z</datestamp>
<setSpec>collection:admin_set/default</setSpec>
</header>
<metadata>
<oai_dc:dc
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<dc:publisher>utk</dc:publisher>
<dc:rights>http://rightsstatements.org/vocab/InC/1.0/</dc:rights>
<dc:title>504 error (shana)</dc:title>
</oai_dc:dc>
</metadata>
</record>
HOCR also is not something we want a record for:
<record><header><identifier>oai:hyku:ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier><datestamp>2024-07-24T05:13:52Z</datestamp><setSpec>collection:admin_set/default</setSpec></header><metadata><mods version="3.5" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd" xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<titleInfo>
<title>HOCR</title>
</titleInfo>
<identifier type="uuid">ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier>
<originInfo/>
<physicalDescription/>
<subject/>
<subject>
<cartographics/>
</subject>
<location>
<url usage="primary" access="object in context">https://digitalcollections.lib.utk.edu/concern/attachments/ccb61e6c-0c22-471c-9a22-e4dfe2953c62</url>
<url access="preview" xlink:href="https://digitalcollections.lib.utk.edu/assets/work-ff055336041c3f7d310ad69109eda4a887b16ec501f35afc0a547c4adb97ee72.png"/>
</location>
<recordInfo>
<recordIdentifier>ccb61e6c-0c22-471c-9a22-e4dfe2953c62</recordIdentifier>
<recordOrigin>https://digitalcollections.lib.utk.edu/catalog/oai</recordOrigin>
<recordCreationDate>2024-05-17T19:45:38Z</recordCreationDate>
<recordChangeDate>2024-05-17T21:53:27Z</recordChangeDate>
</recordInfo>
</mods></metadata></record>
Restricting the types of works that show in your OAI feed will should be a new ticket, as it is separate from the requirements of this ticket.
This is due to the way the child works are created to allow additional metadata for file sets in your repo. We will need to identify which specific information we need to exclude and override standard OAI behavior.
@kirkkwang - I was able to pull both MODS and DC. Given that all of the sets have to be pulled each time, right now the time needed to get OAI is a bit restrictive, but this will be addressed when the ability to pull separate collections is added (#680). I approve the work completed in this ticket.
Great! thank you @mlhale7
Story
ref. #665 I am unable to pull metadata using OAI-PMH with my regular tools (https://github.com/vphill/pyoaiharvester) and I do not see individual records when navigating the feed (https://digitalcollections.lib.utk.edu/catalog/oai) in the browser. There appears to be an identifier issue that is causing no records to be retrievable.
Acceptance Criteria
Screenshots / Video
When I use pyoaiharvester, I get a ZeroDivisionError. Here's a screenshot showing the command and error:
When I click on "oai_dc" in the browser to retrieve an individual record, I get the error "idDoesNotExist." Here's a screenshot:
Finally, looking in the browser at https://digitalcollections.lib.utk.edu/catalog/oai, I am seeing records for attachments that I would not expect (PRESERVE, MODS, etc.) We just want a single record to appear for each digital asset.
Testing Instructions and Sample Files
-
Notes
A conjecture - potentially this issue was introduced when we changed the URL to "digitalcollections.lib.utk.edu"?