Open jacobthill opened 1 week ago
@jacobthill do you now if HathiTrust offers the "download json" at a reusable URL? If so, that would be very helpful.
Poking around in their html, I see we can do something like this:
https://babel.hathitrust.org/cgi/mb?c=466077623;source=hathifiles;a=download
If we provide the collection id in the catalog, we can build that url.
Hathitrust has a significant amount of relevant content that we would like to ingest via airflow. It takes several steps to harvest however.
For example:
.xml
to get an xml metadata file.There are thousands of other items. We can create our own collections to harvest them in the same workflow as above. e.g. I created this collection of Arabic manuscripts at McGill.
So essentially what we need is a way of passing this list of xml urls to our xml driver.