sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

oai_xml driver has unexpected harvesting behavior with ucla collections #488

Closed jacobthill closed 4 months ago

jacobthill commented 4 months ago

Looking at the ucla armenian_manuscripts collection view-source:https://digital.library.ucla.edu/catalog/oai?verb=ListRecords&metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:d2xg9000zz-89112

The records have values for dc:description and dc:identifier but neither of these show up in the harvested data. There is an identifier field in the harvested data but it contains values from edm:object and edm:isShownAt. The other values with a dc namespace are harvested without specifying a path and the other values with an edm namespace aren't harvested at all as far as I can tell. I've tried configuring the paths for these fields but can't get it to work.

A typical OAI record looks something like this:

</record>
    <record><header>
   <identifier>oai:HAL:medihal-00852404v1</identifier>
   <datestamp>2023-06-02</datestamp><setSpec>type:IMG</setSpec>
<setSpec>subject:shs</setSpec>
<setSpec>collection:SHS</setSpec>
<setSpec>collection:CNRS</setSpec>
<setSpec>collection:IFPO</setSpec>
<setSpec>collection:IFPOIMAGES</setSpec>
<setSpec>collection:FRANTIQ</setSpec>
<setSpec>collection:CAMPUS-AAR</setSpec>
<setSpec>collection:AAI</setSpec>
<setSpec>collection:MOYEN_ORIENT_ET_MONDES_MUSULMANS</setSpec>
</header>
<metadata xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/"><oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:tei="http://www.tei-c.org/ns/1.0" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/  http://www.openarchives.org/OAI/2.0/oai_dc.xsd http://purl.org/dc/elements/1.1/  http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd">
  <dc:publisher>HAL CCSD</dc:publisher>
  <dc:title xml:lang="fr">Tombeau de Aailami et Zébida , fragment de porte à caisson (Palmyre, Syrie)</dc:title>
  <dc:creator>(ifpo), Institut Français Du Proche-Orient</dc:creator>
  <dc:contributor>Institut Français du Proche-Orient (IFPO) ; Ministère de l'Europe et des Affaires étrangères (MEAE)-Centre National de la Recherche Scientifique (CNRS)</dc:contributor>
  <dc:identifier>medihal-00852404</dc:identifier>
  <dc:identifier>https://media.hal.science/medihal-00852404</dc:identifier>
  <dc:identifier>https://media.hal.science/medihal-00852404/image</dc:identifier>
  <dc:identifier>https://media.hal.science/medihal-00852404/file/IMG02948_cd329.jpg</dc:identifier>
  <dc:source>https://media.hal.science/medihal-00852404</dc:source>
  <dc:source>Photography. Ifpo-02948, Syria. 1923</dc:source>
  <dc:language>en</dc:language>
  <dc:subject xml:lang="fr">Ifpo</dc:subject>
  <dc:subject xml:lang="fr">Syrie</dc:subject>
  <dc:subject xml:lang="fr">Palmyre</dc:subject>
  <dc:subject xml:lang="fr">Tadmor</dc:subject>
  <dc:subject xml:lang="fr">تدمر</dc:subject>
  <dc:subject xml:lang="fr">sculpture ornementale</dc:subject>
  <dc:subject xml:lang="fr">porte</dc:subject>
  <dc:subject xml:lang="fr">calcaire de Palmyre</dc:subject>
  <dc:subject xml:lang="fr">époque romaine</dc:subject>
  <dc:subject>[SHS.ARCHEO]Humanities and Social Sciences/Archaeology and Prehistory</dc:subject>
  <dc:type>info:eu-repo/semantics/other</dc:type>
  <dc:type>Photos</dc:type>
  <dc:description xml:lang="fr">
              Époque romaine ; Mission archéologique française à Palmyre. 1923. Gélatine (9x14)
            </dc:description>
  <dc:rights>http://creativecommons.org/licenses/by-nc-nd/</dc:rights>
  <dc:date>1923</dc:date>
  <dc:rights>info:eu-repo/semantics/OpenAccess</dc:rights>
</oai_dc:dc>
</metadata></record>

The ucla records have an additional element under metadata and above the data fields:

    <record>
      <header>
        <identifier>oai:library.ucla.edu:ark:/21198/zz0026w3dw</identifier>
        <datestamp>2023-04-27T20:22:01Z</datestamp>
        <setSpec>member_of_collection_ids_ssim:d2xg9000zz-89112</setSpec>
      </header>
      <metadata>
        <oai_dpla:dpla
          xmlns:oai_dpla="https://digital.library.ucla.edu/oai_dpla/"
          xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:dcterms="http://purl.org/dc/terms/"
          xmlns:edm="http://www.europeana.eu/schemas/edm/"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="https://digital.library.ucla.edu/oai_dpla/ https://digital.library.ucla.edu/oai_dpla/oai_dpla.xsd">
          <dc:source>Armenian Manuscripts</dc:source>
          <dc:title>Manuscript No. 82: Fragments of a Menologium</dc:title>
          <dc:type>http://id.loc.gov/vocabulary/resourceTypes/txt</dc:type>
          <dc:rights>UCLA Library Special Collections, A1713 Charles E. Young Research Library, Box 951575, Los Angeles, CA 90095-1575. Email: spec-coll@library.ucla.edu. Phone: (310) 825-4936</dc:rights>
          <dc:language>arm</dc:language>
          <dc:identifier>Armenian MS 82</dc:identifier>
          <dc:identifier>ark:/21198/zz0026w3dw</dc:identifier>
          <edm:object>https://iiif.library.ucla.edu/iiif/2/ark%3A%2F21198%2Fzz00291z5c/full/full/0/default.jpg</edm:object>
          <edm:isShownAt>https://digital.library.ucla.edu/catalog/ark:/21198/zz0026w3dw</edm:isShownAt>
          <edm:dataProvider>University of California, Los Angeles. Library. Department of Special Collections</edm:dataProvider>
          <edm:hasType>Manuscripts</edm:hasType>
        </oai_dpla:dpla >
      </metadata>
    </record>

This might be causing the unexpected behavior.

edsu commented 4 months ago

It looks like the URL you pasted above is requesting the oai_dpla metadata format?

https://digital.library.ucla.edu/catalog/oai?verb=ListRecords&metadataPrefix=oai_dpla&set=member_of_collection_ids_ssim:d2xg9000zz-89112

I guess it is picking up the wrong metadata_prefix somehow. It should be oai_dc right?

edsu commented 4 months ago

For whatever reason it doesn't look like their oai_dc metadata flavor contains dc:description elements?

If you want to harvest the oai_dpla flavored records I think we will need to update the oai_xml driver so that it's aware of that namespace?

https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/drivers/oai_xml.py#L14-L18

jacobthill commented 4 months ago

Good catch Ed, I tried both but didn't notice the description was only in the oai_dpla version. I will see if I can add that namespace

jacobthill commented 4 months ago

i got it working, thanks Ed