sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Airflow DAG for creating empty serial records #578

Open jacobthill opened 1 week ago

jacobthill commented 1 week ago

We want to build a unified serials catalog. One use case it needs to support is to let us (and data providers) know what issues were printed but are not digitally available. This will help them prioritize digitization work since digitizing serials is complicated and they don't want to digitize an issue that another data provider already digitized. I'm open to ideas on how to implement this but it needs to be flexible since our knowledge of what issues were printed will change as our parters do more research.

One thought on how to implement this is to create a new field in the IR record e.g. placeholder. We could then read from a file on what issues were printed and build a placeholder record for each issue. The placeholder record could have the placeholder field, title, record id (we will build these in a predictable way), issue number, and a note indicating that the digital record is not present and please let us know if it exists online somewhere or if you have a physical copy you would be willing to digitize. When we load an actual digital copy of one of the issues, it will overwrite the placeholder record because it will have the same id.

We will need a DAG for loading these placeholder records that would look somethine like:

delete_all_placeholder_records_in_app > refresh_model_of_expected_placeholder_records > build_placeholder_records > index_placeholder_records

delete_all_placeholder_records_in_app: this would clear existing placeholder records from the app e.g. find all records with the placeholder field and delete them. This is important because our knowledge of what issues were printed will change over time and we need the app to reflect that changing knowledge.

refresh_model_of_expected_placeholder_records: this will update our model of what issues we printed.

build_placeholder_records: this will build new placeholder records from the refreshed model.

index_placeholder_records: index the new placeholder records in the app.