sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Automate harvesting of OPenn collections #526

Open jacobthill opened 2 months ago

jacobthill commented 2 months ago

OPenn is one of the data providers that has not yet been automated to run in Airflow. None of our current drivers will work: https://openn.library.upenn.edu/TechnicalReadMe.html

Essentially, we need our driver to navigate this html page and grab all of the links for the TEI documents for each collection. As far as I can tell, the technical documentation above does not provide a way for us to know all of the records in each of the Manuscripts of the Muslim World collections. Without this, the only way I can see to keep DLME synced with OPenn is to parse the above html page, splitting on the <h3> element, grabbing the <ul> and then each url from each <li>. From there, it is just xml so maybe we can reuse some/all of our xml driver code?

We should have a discussing to see if airflow is a good solution for this. If not I can come up with a manual solution that transforms from json so we can remove the need to support xml in traject.