ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

Spec out CDL ETL’ing of Solr records to ElasticSearch #618

Closed christinklez closed 7 months ago

amywieliczka commented 8 months ago

This would need to:

1) run a paginated Solr query for all objects in a particular collection, effectively the same query that powers this page: https://calisphere.org/collections/26233/ 2) stash these "fetched" pages to s3? 3) for each page of Solr objects, map the fields to the same fields in OpenSearch (for this most part, this is a direct mapping) 4) stash these "mapped" pages to s3? 5) migrate thumbnails from the legacy s3 location (in dsc acct) to the new s3 thumbnail location (in pad acct) 6) for each page of mapped metadata, bulk index to OpenSearch

We could actually use many components of the Rikolti Pipeline for this - we'd need to update the collection's harvest type field to have a value of "Rikolti ETL" (no need to update the other fetching related fields). Then we could write a CalisphereSolrFetcher that queries our Solr index given a collection ID from the registry. Then we could update the collections enrichment items to include exactly one item /dpla-mapper?mapper_type=rikolti_etl. Then we could write a CalisphereSolrMapper that implements this straightforward mapping. Then we would write a new MigrateThumbnails task that would read every page of mapped metadata and go find the thumbnails in our s3 bucket at the md5 listed in the metadata and copy those thumbnails over to the new s3 bucket. Finally, we'd use the existing create new index task to load records into OpenSearch.

We already have machinery for running the fetching and mapping by mapper type (rather than by collection). We could create a new DAG that fetches, maps, migrates thumbnails, and loads by mapper type (rather than by collection) to bulk process all of these collections, or maybe 10-20 of these collections at a time.

Airflow has access to Solr queries, has access to the Rikolti s3 buckets and the Rikolti OpenSearch instance. The thumbnail s3 bucket is actually open to the public, so Airflow should be able to access it as well.

Let's discuss this high level approach in 2024, maybe after we get the publish to prod DAG complete.