htrc / htrc-feature-reader

Tools for working with HTRC Feature Extraction files
37 stars 12 forks source link

Access denied via Volume() #45

Open Ori-Pixel opened 2 years ago

Ori-Pixel commented 2 years ago
from htrc_features import Volume

print(Volume('mdp.39015028036104').tokenlist())

ERROR:root:HTTP Error accessing http://data.analytics.hathitrust.org/features-2020.03/mdp\31230\mdp.39015028036104.json.bz2

I can access the page via browser, but it gives a 'no online access' page. I can download the json file, but I cannot use that within the Volume()

If I try path = r""

I get: OSError: Invalid data stream

bmschmidt commented 2 years ago

Thank you for the clear report. What is your system setup? (Especially OS and character encoding). This code works fine on my Mac and on a clean google colab instance, so I suspect it must be some kind of system-specific escaping problem.

bmschmidt commented 2 years ago

Also, is this a problem with all htids, or just the one here?

Given the backslashes where there should be slashes in that URL, it seems possible that this is pathlib overcompensating for being on Windows, but I'm not sure why we wouldn't have seen this before.

"mdp\31230\mdp"
Ori-Pixel commented 2 years ago

Setup:

Windows 10 Pro
IsSingleByte      : True
BodyName          : iso-8859-1
EncodingName      : Western European (Windows)
HeaderName        : Windows-1252
WebName           : Windows-1252
WindowsCodePage   : 1252
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 1252

The first/top error when only passing: Volume('mdp.39015029970129').tokenlist())

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\<name>\\AppData\\Local\\Temp\\mdp.39015029970129.json.bz2'

It doesn't work for any IDs for me. It was working on Mac/Manjaro

bmschmidt commented 2 years ago

Thanks. Let me tag in @organisciak since I think he develops this on Windows.

Ori-Pixel commented 2 years ago

@bmschmidt The error seems to only exist in 2.0.7. If you downgrade the package to 2.0.6, it works.

bmschmidt commented 2 years ago

Thanks for looking in further. It looks like 2.0.7 is a version number that only appears on the MassiveTexts branch. My best guess is that this problem was have been introduced here. It calls os.path.join here to build a url, which leaves the slashes facing the wrong way on windows when it's put into a URL. The most elegant fix would probably be to switch to pathlib.Path instances from strings for the return values there, but I'm not sure if it would work.

@organisciak, do you still have Windows to fix/test this? I've proposed https://github.com/massivetexts/htrc-feature-reader/pull/29 as a really stupid way to fix what appears to the be the problem, but I don't know if it's a full fix.