A common pattern with data providers is to have minimal data in iiif manifests and more thorough data in oai-pmh. However, the oai-pmh data lacks the iiif manifest forcing us to choose between better discovery (oai-pmh) and a better user experience (iiif). We need a method for harvesting the metadata from oai-pmh and mapping the iiif thumbnail url back to the iiif manifest url. This ticket can be used to track notes on how to approach this.
One possible solution is to run a pre_harvest task in airlfow that hits the iiif collection manifest, opens every manifest and builds a dictionary (yaml file) that maps thumbnail urls back to iiif manifests. This yaml file can be written to dlme-transform/lib/translation_maps for access during transform. Since its run as a pre_harvest task, it will be updated every time the DAG runs. This can be reusable for every data provider with a similar issue. Its possible that the pre_harvest task can make use of existing drivers, that way it can be configurable e.g. add something like this to the catalog:
This would allow us to construct different kinds of yaml files from any data we harvest in the pre-harvest source. Another pre_harvest task use case is to harvest a list of identifiers e.g. when there is no collection level manifest. So the catalog field choices should also be compatible with this use case.
Another strategy could be to list the same collection twice in the catalog (one iiif and one oai_pmh) and then merge the two dataframes together on the thumbnail. This would work similar to how qnl works now except its merging two dataframes instead of two records from one dataframe.
Similar Collections:
AUB (ACO) metadata is harvested through AUBs OAI-PMH server but NYU has a IIIF manifest on their ACO site. Need to investigate but I assume this pattern will work.
A common pattern with data providers is to have minimal data in iiif manifests and more thorough data in oai-pmh. However, the oai-pmh data lacks the iiif manifest forcing us to choose between better discovery (oai-pmh) and a better user experience (iiif). We need a method for harvesting the metadata from oai-pmh and mapping the iiif thumbnail url back to the iiif manifest url. This ticket can be used to track notes on how to approach this.
One possible solution is to run a pre_harvest task in airlfow that hits the iiif collection manifest, opens every manifest and builds a dictionary (yaml file) that maps thumbnail urls back to iiif manifests. This yaml file can be written to dlme-transform/lib/translation_maps for access during transform. Since its run as a pre_harvest task, it will be updated every time the DAG runs. This can be reusable for every data provider with a similar issue. Its possible that the pre_harvest task can make use of existing drivers, that way it can be configurable e.g. add something like this to the catalog:
This would allow us to construct different kinds of yaml files from any data we harvest in the pre-harvest source. Another pre_harvest task use case is to harvest a list of identifiers e.g. when there is no collection level manifest. So the catalog field choices should also be compatible with this use case.
Another strategy could be to list the same collection twice in the catalog (one iiif and one oai_pmh) and then merge the two dataframes together on the thumbnail. This would work similar to how qnl works now except its merging two dataframes instead of two records from one dataframe.
Similar Collections: