sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Harvest info from multiple sources: iiif & oai-pmh #427

Closed jacobthill closed 4 months ago

jacobthill commented 1 year ago

A common pattern with data providers is to have minimal data in iiif manifests and more thorough data in oai-pmh. However, the oai-pmh data lacks the iiif manifest forcing us to choose between better discovery (oai-pmh) and a better user experience (iiif). We need a method for harvesting the metadata from oai-pmh and mapping the iiif thumbnail url back to the iiif manifest url. This ticket can be used to track notes on how to approach this.

One possible solution is to run a pre_harvest task in airlfow that hits the iiif collection manifest, opens every manifest and builds a dictionary (yaml file) that maps thumbnail urls back to iiif manifests. This yaml file can be written to dlme-transform/lib/translation_maps for access during transform. Since its run as a pre_harvest task, it will be updated every time the DAG runs. This can be reusable for every data provider with a similar issue. Its possible that the pre_harvest task can make use of existing drivers, that way it can be configurable e.g. add something like this to the catalog:

pre_harvest:
  driver: iiif_json
  fields:
    thumbnail:
    manifest:
  yaml_output:
    thumbnail: manifest

This would allow us to construct different kinds of yaml files from any data we harvest in the pre-harvest source. Another pre_harvest task use case is to harvest a list of identifiers e.g. when there is no collection level manifest. So the catalog field choices should also be compatible with this use case.

Another strategy could be to list the same collection twice in the catalog (one iiif and one oai_pmh) and then merge the two dataframes together on the thumbnail. This would work similar to how qnl works now except its merging two dataframes instead of two records from one dataframe.

Similar Collections: