htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

HTTP Error with Volume() #48

Open melaniewalsh opened 1 year ago

melaniewalsh commented 1 year ago

I'm trying to fetch HathiTrust metadata for books in a spreadsheet via their HathiTrust IDs and Volume()

Screenshot 2023-04-18 at 4 58 29 PM

But I'm getting a lot of HTTP Errors like so, even though this URL does exist and contains HathiTrust data:

ERROR:root:HTTP Error accessing http://data.analytics.hathitrust.org/features-2020.03/mdp/31532/mdp.39015054033520.json.bz2

Screenshot 2023-04-18 at 5 02 19 PM

This issue seems similar to issue #45, but I'm using a Mac, not a Windows computer. Also this doesn't happen with all HathiTrust IDs, only some of them.

Any thoughts about what might be going wrong?

bmschmidt commented 1 year ago

Are specific IDs always broken, or just sometimes? That one works fine for me in a colab notebook.

!pip install htrc-feature-reader

from htrc_features import Volume
v = Volume('mdp.39015054033520')
v.tokenlist()

Also possibly include the end of the error trace? hard to see what the http error code is here.

organisciak commented 1 year ago

Hmm, odd. I second Ben's question: when you say 'this doesn't happen with all HathiTrust IDs, only some of them': are the ones that succeed or fail consistently, or will the same ID sometimes fail and sometimes succeed?

That uses rsync with a subprocess, which is why the error catching is so poor. I suspect the file is failing to download but Python isn't catching it and still trying to open the volume.

By the way, if you're just loading metadata, there's also the https://www.hathitrust.org/hathifiles, or the HathiTrust Bib API: https://www.hathitrust.org/bib_api.

bmschmidt commented 1 year ago

Oh yeah that's right this is kind of a painful way to get metadata. There is some data that Hathi only distributes through here, not the Hathifiles (e.g. LC classification) -- @melaniewalsh send me an e-mail if this is what you're looking for, I believe that I have some stuff about parsing this sitting in my e-mail somewhere.

melaniewalsh commented 1 year ago

Thanks @bmschmidt @organisciak! It's good to know about the HathiTrust Bib API.

There are a few reasons that I'm trying to get metadata from the Hathi IDs. We specifically included Hathi IDs with all book data in the Post45 Data Collective (e.g. NYT bestsellers) to enable people to work with the full texts/bags of words in HathiTrust. But I recently realized that the Hathi IDs are basically also our only consistent unique identifier for books, so now I'm trying to retroactively add ISBN and OCLC numbers, so we can make the datasets interoperable with other data about the same books. Similarly, I want to add ISBN/OCLC numbers to some of the Hathi derived datasets, like the Geographic Locations data, to make them interoperable with data like the Seattle Public Library's collection or circulation data.

Anyway, that's a long-winded way of saying that the HathiTrust Bib API sounds like it might be better for my metadata needs. But I would still like to create some notebooks and resources that demonstrate how you can take the Post45 Data Collective data and connect it with HathiTrust text data.

I'm including the full error message below (it's long). I'm calling Volume() on like 5,000 rows in a spreadsheet by applying a function to a column (I also tried looping through the data with Volume()), so I was wondering if it's happening too quickly or maybe the timing is the problem?

Error message 👇 ```ERROR:root:HTTP Error accessing http://data.analytics.hathitrust.org/features-2020.03/mdp/31132/mdp.39015018932429.json.bz2 Traceback (most recent call last): File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py", line 73, in open fout = super().open(id, **kwargs) File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py", line 121, in open uncompressed = self._open(id = id, suffix = suffix, mode = mode, format = format, File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py", line 203, in _open return Path(dir, filename).open(mode = mode) File "/Users/melwalsh/opt/anaconda3/lib/python3.8/pathlib.py", line 1221, in open return io.open(self, mode, buffering, encoding, errors, newline, File "/Users/melwalsh/opt/anaconda3/lib/python3.8/pathlib.py", line 1077, in _opener return self._accessor.open(self, flags, mode) FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/t1/1xbnlp5j163cd9mt_ht253cw0000gp/T/mdp.39015018932429.json.bz2' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/melwalsh/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py", line 182, in _open byt = _urlopen(path_or_url).read() File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 531, in open response = meth(req, response) File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 640, in http_response response = self.parent.error( File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 569, in error return self._call_chain(*args) File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 502, in _call_chain result = func(*args) File "/Users/melwalsh/opt/anaconda3/lib/python3.8/urllib/request.py", line 649, in http_error_default raise HTTPError(req.full_url, code, msg, hdrs, fp) urllib.error.HTTPError: HTTP Error 404: Not Found --------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py in open(self, id, fallback_kwargs, **kwargs) 72 try: ---> 73 fout = super().open(id, **kwargs) 74 logging.debug("Successfully returning from cache") ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in open(self, id, suffix, format, mode, skip_compression, compression, **kwargs) 120 --> 121 uncompressed = self._open(id = id, suffix = suffix, mode = mode, format = format, 122 compression=compression, **kwargs) ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in _open(self, id, format, mode, compression, dir, suffix, **kwargs) 202 filename = self.fname(id, format = format, compression = compression, suffix = suffix) --> 203 return Path(dir, filename).open(mode = mode) 204 ~/opt/anaconda3/lib/python3.8/pathlib.py in open(self, mode, buffering, encoding, errors, newline) 1220 self._raise_closed() -> 1221 return io.open(self, mode, buffering, encoding, errors, newline, 1222 opener=self._opener) ~/opt/anaconda3/lib/python3.8/pathlib.py in _opener(self, name, flags, mode) 1076 # A stub for the opener argument to built-in open() -> 1077 return self._accessor.open(self, flags, mode) 1078 FileNotFoundError: [Errno 2] No such file or directory: '/var/folders/t1/1xbnlp5j163cd9mt_ht253cw0000gp/T/mdp.39015018932429.json.bz2' During handling of the above exception, another exception occurred: HTTPError Traceback (most recent call last) in ----> 1 hathi_df[["title", "isbn", "oclc", "lccn", "pub_date", "title", "pub_place", "publisher"]] = hathi_df[["hathi_id"]].apply(add_metadata_from_hathi, axis = "columns", result_type = "expand" ) ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py in apply(self, func, axis, raw, result_type, args, **kwargs) 8734 kwargs=kwargs, 8735 ) -> 8736 return op.apply() 8737 8738 def applymap( ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply(self) 686 return self.apply_raw() 687 --> 688 return self.apply_standard() 689 690 def agg(self): ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply_standard(self) 810 811 def apply_standard(self): --> 812 results, res_index = self.apply_series_generator() 813 814 # wrap results ~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in apply_series_generator(self) 826 for i, v in enumerate(series_gen): 827 # ignore SettingWithCopy here in case the user mutates --> 828 results[i] = self.f(v) 829 if isinstance(results[i], ABCSeries): 830 # If we have a view on v, we need to make a copy because in add_metadata_from_hathi(row) 3 input_id = row["hathi_id"] 4 ----> 5 volume = Volume(input_id) 6 7 return volume.title, volume.isbn, volume.oclc, volume.lccn, volume.pub_date, volume.title, volume.pub_place, volume.publisher ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in __init__(self, id, format, id_resolver, default_page_section, path, compression, dir, file_handler, **kwargs) 462 "but requested {} files".format(id_resolver.format, format)) 463 --> 464 self.parser = retrieve_parser(id, format, id_resolver, compression, dir, 465 file_handler=file_handler, **kwargs) 466 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in retrieve_parser(id, format, id_resolver, compression, dir, file_handler, **kwargs) 338 raise NotImplementedError("Must pass a format. Currently 'json' and 'parquet' are supported.") 339 --> 340 return Handler(id, id_resolver = id_resolver, dir = dir, 341 compression = compression, **kwargs) 342 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, compression, **kwargs) 188 189 # parsing and reading are called here. --> 190 super().__init__(id, id_resolver = id_resolver, compression = compression, **kwargs) 191 192 def parse(self, **kwargs): ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, mode, **kwargs) 77 return 78 ---> 79 self.parse(**kwargs) 80 81 def __init_resolver(self, id_resolver, format=None, **kwargs): ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in parse(self, **kwargs) 192 def parse(self, **kwargs): 193 --> 194 obj = self._parse_json() 195 196 self._schema = obj['features']['schemaVersion'] ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in _parse_json(self, compression, **kwargs) 289 if not k in kwargs: 290 kwargs[k] = self.args[k] --> 291 with resolver.open(id, **kwargs) as fin: 292 rawjson = fin.read() 293 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py in open(self, id, fallback_kwargs, **kwargs) 82 return input.parser.open(id, **fallback_kwargs) 83 else: ---> 84 copy_between_resolvers(id, self.fallback, self.super) 85 fout = super().open(id, **kwargs) 86 logging.debug("Successfully returning from cache") ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/caching.py in copy_between_resolvers(id, resolver1, resolver2) 8 def copy_between_resolvers(id, resolver1, resolver2): 9 # print (resolver1, "--->", resolver2) ---> 10 input = Volume(id, id_resolver=resolver1) 11 output = Volume(id, id_resolver=resolver2, mode = 'wb') 12 output.write(input) ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in __init__(self, id, format, id_resolver, default_page_section, path, compression, dir, file_handler, **kwargs) 462 "but requested {} files".format(id_resolver.format, format)) 463 --> 464 self.parser = retrieve_parser(id, format, id_resolver, compression, dir, 465 file_handler=file_handler, **kwargs) 466 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/feature_reader.py in retrieve_parser(id, format, id_resolver, compression, dir, file_handler, **kwargs) 338 raise NotImplementedError("Must pass a format. Currently 'json' and 'parquet' are supported.") 339 --> 340 return Handler(id, id_resolver = id_resolver, dir = dir, 341 compression = compression, **kwargs) 342 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, compression, **kwargs) 188 189 # parsing and reading are called here. --> 190 super().__init__(id, id_resolver = id_resolver, compression = compression, **kwargs) 191 192 def parse(self, **kwargs): ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in __init__(self, id, id_resolver, mode, **kwargs) 77 return 78 ---> 79 self.parse(**kwargs) 80 81 def __init_resolver(self, id_resolver, format=None, **kwargs): ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in parse(self, **kwargs) 192 def parse(self, **kwargs): 193 --> 194 obj = self._parse_json() 195 196 self._schema = obj['features']['schemaVersion'] ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/parsers.py in _parse_json(self, compression, **kwargs) 289 if not k in kwargs: 290 kwargs[k] = self.args[k] --> 291 with resolver.open(id, **kwargs) as fin: 292 rawjson = fin.read() 293 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in open(self, id, suffix, format, mode, skip_compression, compression, **kwargs) 119 skip_compression = True 120 --> 121 uncompressed = self._open(id = id, suffix = suffix, mode = mode, format = format, 122 compression=compression, **kwargs) 123 ~/opt/anaconda3/lib/python3.8/site-packages/htrc_features/resolvers.py in _open(self, id, mode, compression, **kwargs) 180 181 try: --> 182 byt = _urlopen(path_or_url).read() 183 req = BytesIO(byt) 184 except HTTPError: ~/opt/anaconda3/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context) 220 else: 221 opener = _opener --> 222 return opener.open(url, data, timeout) 223 224 def install_opener(opener): ~/opt/anaconda3/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout) 529 for processor in self.process_response.get(protocol, []): 530 meth = getattr(processor, meth_name) --> 531 response = meth(req, response) 532 533 return response ~/opt/anaconda3/lib/python3.8/urllib/request.py in http_response(self, request, response) 638 # request was successfully received, understood, and accepted. 639 if not (200 <= code < 300): --> 640 response = self.parent.error( 641 'http', request, response, code, msg, hdrs) 642 ~/opt/anaconda3/lib/python3.8/urllib/request.py in error(self, proto, *args) 567 if http_err: 568 args = (dict, 'default', 'http_error_default') + orig_args --> 569 return self._call_chain(*args) 570 571 # XXX probably also want an abstract factory that knows when it makes ~/opt/anaconda3/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args) 500 for handler in handlers: 501 func = getattr(handler, meth_name) --> 502 result = func(*args) 503 if result is not None: 504 return result ~/opt/anaconda3/lib/python3.8/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs) 647 class HTTPDefaultErrorHandler(BaseHandler): 648 def http_error_default(self, req, fp, code, msg, hdrs): --> 649 raise HTTPError(req.full_url, code, msg, hdrs, fp) 650 651 class HTTPRedirectHandler(BaseHandler): HTTPError: HTTP Error 404: Not Found```
bmschmidt commented 1 year ago

For adding ISBN/OCLC/LCCN identifiers I would probably use Hathifiles. You can just download and parse the data in. The bibAPI can be slow, IIRC. link They have these columns.

But 5k isn't that much, so the bibAPI is fine.

I'd also just write ht-help--don't know if anyone there monitors this repo, but when I've had this kind of issue it tends to be because some of their servers are on the blink--I think there's some load-balancing for several or something like that.

melaniewalsh commented 1 year ago

Thanks @bmschmidt. That's a good call about reaching out to ht-help (edit: I'm not actually getting the same error with the BibAPI — I'm getting a different error). But I will try out the Hathifiles — thanks for the tip!