scientist-softserv / utk-hyku

Other
6 stars 0 forks source link

Ability to pull OAI-PMH metadata (and view records) #664

Open mlhale7 opened 1 month ago

mlhale7 commented 1 month ago

Story

ref. #665 I am unable to pull metadata using OAI-PMH with my regular tools (https://github.com/vphill/pyoaiharvester) and I do not see individual records when navigating the feed (https://digitalcollections.lib.utk.edu/catalog/oai) in the browser. There appears to be an identifier issue that is causing no records to be retrievable.

Acceptance Criteria

Screenshots / Video

When I use pyoaiharvester, I get a ZeroDivisionError. Here's a screenshot showing the command and error:

commandforHykuOAIwithErrors

When I click on "oai_dc" in the browser to retrieve an individual record, I get the error "idDoesNotExist." Here's a screenshot:

Screenshot 2024-07-23 at 11 45 25 AM

Finally, looking in the browser at https://digitalcollections.lib.utk.edu/catalog/oai, I am seeing records for attachments that I would not expect (PRESERVE, MODS, etc.) We just want a single record to appear for each digital asset.

Screenshot 2024-07-23 at 12 51 54 PM

Testing Instructions and Sample Files

-

Notes

A conjecture - potentially this issue was introduced when we changed the URL to "digitalcollections.lib.utk.edu"?

mlhale7 commented 1 month ago

According to Repox (which we use for DPLA harvesting), our URL does not exist (I've tried both http and https just in case).

Screenshot 2024-07-23 at 1 05 20 PM

kirkkwang commented 1 month ago

An update:

Using the harvest (https://github.com/vphill/pyoaiharvester) as it was not working because https://digitalcollections.lib.utk.edu/ still has the HTTP Auth (where you need to enter the username and password before you can see the site). Modifying the harvester's code to allow you to pass it in via the URL like http://username:password@example.com/ will get past that issue.

However, we are quickly met with a different issue. There seems to be an issue with the SSL certificates.

Image

Details in this report.

Rob said he will address the cert issue. After that is fixed we can try the harvester tool again.

kirkkwang commented 1 month ago

To summarize the findings here:

There are two things at play here that makes the pyoaiharvester not work with the production site.

  1. Currently on the the digitalcollections tenant has basic auth turned on (the pop up where you are prompted for the username/password). The pyoaiharvester tool does not allow passing in basic auth through the URL. The assumption is that this will be turned off for launch so this should not be an issue.
  2. We use CrowdSec as part of our security measures. CrowdSec seems to be blocking the User-Agent pyoaiharvester/3.0. This may be because it has been reported as potentially malicious or exhibiting suspicious behavior. A workaround is to change the User-Agent in the script to something like utklibraryoai. We are investigating the specific reasons for this block and assessing any necessary adjustments to our security configuration.
mlhale7 commented 3 weeks ago

Thanks to @kirkkwang, I was able to successfully pull records in oai_dc and mods format. I'll comment back once I inspect these files more.

mlhale7 commented 3 weeks ago

This may need to be a separate ticket, but I'm finding some odd records in the OAI. For instance:

    <record>
        <header>
            <identifier>oai:hyku:9a77b15d-554d-4dfc-a49f-09fbcee8118c</identifier>
            <datestamp>2024-07-22T23:24:01Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OCR for alumnus:1507298774</dc:description>
                <dc:title>OCR</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

AND

    <record>
        <header>
            <identifier>oai:hyku:2f4595f1-7611-490a-b4b6-94336055d037</identifier>
            <datestamp>2024-07-22T23:23:58Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>TRANSCRIPT for jsevier:9</dc:description>
                <dc:title>TRANSCRIPT</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:f47c15e5-5055-44b7-9b55-526b3e3bfc68</identifier>
            <datestamp>2024-07-22T23:23:58Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>TEI for jsevier:9</dc:description>
                <dc:title>TEI</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:1211e8b1-040f-439d-b5d0-f1ad54f42e8d</identifier>
            <datestamp>2024-07-22T23:23:59Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OBJ for jsevier:25</dc:description>
                <dc:title>OBJ</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
    <record>
        <header>
            <identifier>oai:hyku:72f7889f-da05-4363-a42a-c74354351672</identifier>
            <datestamp>2024-07-22T23:24:00Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:description>OBJ for jsevier:2</dc:description>
                <dc:title>OBJ</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>

I'm including @laritakr here for informational purposes. None of these types of resources (OBJ, Transcript, OCR, etc) should be present in OAI. We just want the main record. Getting rid of all of these extra records would also make pulling OAI a lot faster. Ultimately the issue is that I would need to find a way to exclude all of these extra attachments for DPLA ingests etc. if not removed as this information is not needed.

mlhale7 commented 3 weeks ago

Here's another odd record:

    <record>
        <header>
            <identifier>oai:hyku:24ac22aa-106b-4a88-a346-9e264d13d972</identifier>
            <datestamp>2023-09-01T03:59:26Z</datestamp>
            <setSpec>collection:admin_set/default</setSpec>
        </header>
        <metadata>
            <oai_dc:dc
                xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
                xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
                xmlns:dc="http://purl.org/dc/elements/1.1/"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <dc:publisher>utk</dc:publisher>
                <dc:rights>http://rightsstatements.org/vocab/InC/1.0/</dc:rights>
                <dc:title>504 error (shana)</dc:title>
            </oai_dc:dc>
        </metadata>
    </record>
mlhale7 commented 3 weeks ago

HOCR also is not something we want a record for:

<record><header><identifier>oai:hyku:ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier><datestamp>2024-07-24T05:13:52Z</datestamp><setSpec>collection:admin_set/default</setSpec></header><metadata><mods version="3.5" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-5.xsd" xmlns="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <titleInfo>
    <title>HOCR</title>
  </titleInfo>
  <identifier type="uuid">ccb61e6c-0c22-471c-9a22-e4dfe2953c62</identifier>
  <originInfo/>
  <physicalDescription/>
  <subject/>
  <subject>
    <cartographics/>
  </subject>
  <location>
    <url usage="primary" access="object in context">https://digitalcollections.lib.utk.edu/concern/attachments/ccb61e6c-0c22-471c-9a22-e4dfe2953c62</url>
    <url access="preview" xlink:href="https://digitalcollections.lib.utk.edu/assets/work-ff055336041c3f7d310ad69109eda4a887b16ec501f35afc0a547c4adb97ee72.png"/>
  </location>
  <recordInfo>
    <recordIdentifier>ccb61e6c-0c22-471c-9a22-e4dfe2953c62</recordIdentifier>
    <recordOrigin>https://digitalcollections.lib.utk.edu/catalog/oai</recordOrigin>
    <recordCreationDate>2024-05-17T19:45:38Z</recordCreationDate>
    <recordChangeDate>2024-05-17T21:53:27Z</recordChangeDate>
  </recordInfo>
</mods></metadata></record>
laritakr commented 3 weeks ago

Restricting the types of works that show in your OAI feed will should be a new ticket, as it is separate from the requirements of this ticket.

This is due to the way the child works are created to allow additional metadata for file sets in your repo. We will need to identify which specific information we need to exclude and override standard OAI behavior.

mlhale7 commented 1 week ago

@kirkkwang - I was able to pull both MODS and DC. Given that all of the sets have to be pulled each time, right now the time needed to get OAI is a bit restrictive, but this will be addressed when the ability to pull separate collections is added (#680). I approve the work completed in this ticket.

kirkkwang commented 1 week ago

Great! thank you @mlhale7