Open melaniewalsh opened 1 year ago
Are specific IDs always broken, or just sometimes? That one works fine for me in a colab notebook.
!pip install htrc-feature-reader
from htrc_features import Volume
v = Volume('mdp.39015054033520')
v.tokenlist()
Also possibly include the end of the error trace? hard to see what the http error code is here.
Hmm, odd. I second Ben's question: when you say 'this doesn't happen with all HathiTrust IDs, only some of them': are the ones that succeed or fail consistently, or will the same ID sometimes fail and sometimes succeed?
That uses rsync with a subprocess, which is why the error catching is so poor. I suspect the file is failing to download but Python isn't catching it and still trying to open the volume.
By the way, if you're just loading metadata, there's also the https://www.hathitrust.org/hathifiles
, or the HathiTrust Bib API: https://www.hathitrust.org/bib_api
.
Oh yeah that's right this is kind of a painful way to get metadata. There is some data that Hathi only distributes through here, not the Hathifiles (e.g. LC classification) -- @melaniewalsh send me an e-mail if this is what you're looking for, I believe that I have some stuff about parsing this sitting in my e-mail somewhere.
Thanks @bmschmidt @organisciak! It's good to know about the HathiTrust Bib API.
There are a few reasons that I'm trying to get metadata from the Hathi IDs. We specifically included Hathi IDs with all book data in the Post45 Data Collective (e.g. NYT bestsellers) to enable people to work with the full texts/bags of words in HathiTrust. But I recently realized that the Hathi IDs are basically also our only consistent unique identifier for books, so now I'm trying to retroactively add ISBN and OCLC numbers, so we can make the datasets interoperable with other data about the same books. Similarly, I want to add ISBN/OCLC numbers to some of the Hathi derived datasets, like the Geographic Locations data, to make them interoperable with data like the Seattle Public Library's collection or circulation data.
Anyway, that's a long-winded way of saying that the HathiTrust Bib API sounds like it might be better for my metadata needs. But I would still like to create some notebooks and resources that demonstrate how you can take the Post45 Data Collective data and connect it with HathiTrust text data.
I'm including the full error message below (it's long). I'm calling Volume()
on like 5,000 rows in a spreadsheet by applying a function to a column (I also tried looping through the data with Volume()
), so I was wondering if it's happening too quickly or maybe the timing is the problem?
For adding ISBN/OCLC/LCCN identifiers I would probably use Hathifiles. You can just download and parse the data in. The bibAPI can be slow, IIRC. link They have these columns.
But 5k isn't that much, so the bibAPI is fine.
I'd also just write ht-help--don't know if anyone there monitors this repo, but when I've had this kind of issue it tends to be because some of their servers are on the blink--I think there's some load-balancing for several or something like that.
Thanks @bmschmidt. That's a good call about reaching out to ht-help (edit: I'm not actually getting the same error with the BibAPI — I'm getting a different error). But I will try out the Hathifiles — thanks for the tip!
I'm trying to fetch HathiTrust metadata for books in a spreadsheet via their HathiTrust IDs and
Volume()
But I'm getting a lot of HTTP Errors like so, even though this URL does exist and contains HathiTrust data:
ERROR:root:HTTP Error accessing http://data.analytics.hathitrust.org/features-2020.03/mdp/31532/mdp.39015054033520.json.bz2
This issue seems similar to issue #45, but I'm using a Mac, not a Windows computer. Also this doesn't happen with all HathiTrust IDs, only some of them.
Any thoughts about what might be going wrong?