Via unable to successfully load PDF

mattdricker commented 3 years ago

Per user report https://app.hubspot.com/contacts/6291320/ticket/427633584/

Attempting to open a certain PDF through Via, the loading bar stops part way and Hypothesis sidebar never appears.

Example: https://via.hypothes.is/https://bleakhouse.wordpress.ncsu.edu/files/2020/01/Bleak-House-Working-Notes-Transcriptions.pdf

Opening the PDF without Via and activating Hypothesis extension manually works fine.

Notes from @robertknight on 2022-09-26:

The problem happens when Via fetches the URL to determine the file type (HTML vs PDF). The original URL (https://bleakhouse.wordpress.ncsu.edu/files/2020/01/Bleak-House-Working-Notes-Transcriptions.pdf) returns a 403 response when fetched using the Python requests' library's default User-Agent header. This response causes Via to try to serve the response using ViaHTML (as HTML) rather than Via (as a PDF). When ViaHTML serves the PDF, it gets loaded in the browser's built-in PDF viewer, instead of PDF.js.

I believe we can fix the problem by making Via proxy the browser's user agent when it fetches the URL, to determine the content type.

mattdricker commented 3 years ago

@robertknight has investigated as documented here https://hypothes-is.slack.com/archives/C2BLQDKHA/p1622042949086500

So it looks like we need to do two things here:

Handle the case where a non-200 response is returned and be prepared for the content type being incorrect (because a request that should return a PDF instead returns an HTML error page)

Change the User-Agent header so that it is different from the default one used by the Python requests library, to avoid scenarios where that user agent has been blocked due to abuse from scripts

mkdir-washington-edu commented 3 years ago

Of the two bullet points, Rob says hypothesis/support-legacy#2 is easier and we should do that now.

mattdricker commented 3 years ago

Another PDF example, superficially similar: https://via.hypothes.is/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4370913/pdf/ndt-11-715.pdf

Reported by user at https://app.hubspot.com/contacts/6291320/ticket/437006428/

robertknight commented 3 years ago

@mattdricker found another example of a site rejecting requests that come from the python/requests library:

[I]  ~/h/client (vi)> python
Python 3.9.4 (default, May  6 2021, 07:36:22)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> rsp = requests.get('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4370913/pdf/ndt-11-715.pdf#annotations:group:__world__')
>>> rsp
<Response [403]>
>>> rsp.headers
{'Date': 'Thu, 03 Jun 2021 08:46:39 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Referrer-Policy': 'origin-when-cross-origin', 'Content-Security-Policy': 'upgrade-insecure-requests', 'Cache-Control': 'private', 'NCBI-PHID': '8A1AE7F00B895B0100000000006E006E.m_6', 'NCBI-SID': '8A1AE7F00B896EF1_0110SID', 'Content-Type': 'text/html; charset=UTF-8', 'Set-Cookie': 'ncbi_sid=8A1AE7F00B896EF1_0110SID; domain=.nih.gov; path=/; expires=Fri, 03 Jun 2022 08:46:39 GMT, WebEnv=14aIe4%408A1AE7F00B896EF1_0110SID; domain=.nlm.nih.gov; path=/; expires=Thu, 03 Jun 2021 16:46:39 GMT', 'X-UA-Compatible': 'IE=Edge', 'X-XSS-Protection': '1; mode=block', 'Keep-Alive': 'timeout=1, max=10', 'Connection': 'Keep-Alive', 'Transfer-Encoding': 'chunked'}
>>> rsp.headers['Content-Type']
'text/html; charset=UTF-8'
>>>
>>> rsp = requests.get('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4370913/pdf/ndt-11-715.pdf#annotations:group:__world__', headers={'User-Agent': 'Definitely-Not-The-Python-Requests-Library'})
>>> rsp
<Response [200]>
>>>

robertknight commented 3 years ago

Legacy Via used to forward the User-Agent of the end-user's browser and append a Hypothesis-Via token. See https://github.com/hypothesis/legacyvia/blob/285a00676f9ec7deb64e9a91ea0c1d008ad86ee8/via/app.py#L34. Since this is a proven approach it might be wise to do the same.

mattdricker commented 2 years ago

User who original reported recently pinged us to see if there had been any update https://app.hubspot.com/contacts/6291320/ticket/427633584/

robertknight commented 2 years ago

I don't think the status has changed here. It might be a fairly easy thing to implement original Via's behavior as described in https://github.com/hypothesis/support/issues/207#issuecomment-853919231.

mattdricker commented 1 year ago

Original user is asking again if this can be looked at: https://app.hubspot.com/contacts/6291320/ticket/1125544356

hypothesis / product-backlog

Via unable to successfully load PDF #1389