sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Driver to harvest from hathitrust #575

Open jacobthill opened 1 week ago

jacobthill commented 1 week ago

Hathitrust has a significant amount of relevant content that we would like to ingest via airflow. It takes several steps to harvest however.

For example:

There are thousands of other items. We can create our own collections to harvest them in the same workflow as above. e.g. I created this collection of Arabic manuscripts at McGill.

So essentially what we need is a way of passing this list of xml urls to our xml driver.

aaron-collier commented 1 week ago

@jacobthill do you now if HathiTrust offers the "download json" at a reusable URL? If so, that would be very helpful.

jacobthill commented 1 week ago

Poking around in their html, I see we can do something like this:

https://babel.hathitrust.org/cgi/mb?c=466077623;source=hathifiles;a=download

If we provide the collection id in the catalog, we can build that url.