Essentially, we need our driver to navigate this html page and grab all of the links for the TEI documents for each collection. As far as I can tell, the technical documentation above does not provide a way for us to know all of the records in each of the Manuscripts of the Muslim World collections. Without this, the only way I can see to keep DLME synced with OPenn is to parse the above html page, splitting on the <h3> element, grabbing the <ul> and then each url from each <li>. From there, it is just xml so maybe we can reuse some/all of our xml driver code?
We should have a discussing to see if airflow is a good solution for this. If not I can come up with a manual solution that transforms from json so we can remove the need to support xml in traject.
OPenn is one of the data providers that has not yet been automated to run in Airflow. None of our current drivers will work: https://openn.library.upenn.edu/TechnicalReadMe.html
Essentially, we need our driver to navigate this html page and grab all of the links for the TEI documents for each collection. As far as I can tell, the technical documentation above does not provide a way for us to know all of the records in each of the Manuscripts of the Muslim World collections. Without this, the only way I can see to keep DLME synced with OPenn is to parse the above html page, splitting on the
<h3>
element, grabbing the<ul>
and then each url from each<li>
. From there, it is just xml so maybe we can reuse some/all of our xml driver code?We should have a discussing to see if airflow is a good solution for this. If not I can come up with a manual solution that transforms from json so we can remove the need to support xml in traject.