Open mattdricker opened 3 years ago
@robertknight has investigated as documented here https://hypothes-is.slack.com/archives/C2BLQDKHA/p1622042949086500
So it looks like we need to do two things here:
- Handle the case where a non-200 response is returned and be prepared for the content type being incorrect (because a request that should return a PDF instead returns an HTML error page)
- Change the User-Agent header so that it is different from the default one used by the Python requests library, to avoid scenarios where that user agent has been blocked due to abuse from scripts
Of the two bullet points, Rob says hypothesis/support-legacy#2 is easier and we should do that now.
Another PDF example, superficially similar: https://via.hypothes.is/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4370913/pdf/ndt-11-715.pdf
Reported by user at https://app.hubspot.com/contacts/6291320/ticket/437006428/
@mattdricker found another example of a site rejecting requests that come from the python/requests library:
[I] ~/h/client (vi)> python
Python 3.9.4 (default, May 6 2021, 07:36:22)
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> rsp = requests.get('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4370913/pdf/ndt-11-715.pdf#annotations:group:__world__')
>>> rsp
<Response [403]>
>>> rsp.headers
{'Date': 'Thu, 03 Jun 2021 08:46:39 GMT', 'Server': 'Apache', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Referrer-Policy': 'origin-when-cross-origin', 'Content-Security-Policy': 'upgrade-insecure-requests', 'Cache-Control': 'private', 'NCBI-PHID': '8A1AE7F00B895B0100000000006E006E.m_6', 'NCBI-SID': '8A1AE7F00B896EF1_0110SID', 'Content-Type': 'text/html; charset=UTF-8', 'Set-Cookie': 'ncbi_sid=8A1AE7F00B896EF1_0110SID; domain=.nih.gov; path=/; expires=Fri, 03 Jun 2022 08:46:39 GMT, WebEnv=14aIe4%408A1AE7F00B896EF1_0110SID; domain=.nlm.nih.gov; path=/; expires=Thu, 03 Jun 2021 16:46:39 GMT', 'X-UA-Compatible': 'IE=Edge', 'X-XSS-Protection': '1; mode=block', 'Keep-Alive': 'timeout=1, max=10', 'Connection': 'Keep-Alive', 'Transfer-Encoding': 'chunked'}
>>> rsp.headers['Content-Type']
'text/html; charset=UTF-8'
>>>
>>> rsp = requests.get('https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4370913/pdf/ndt-11-715.pdf#annotations:group:__world__', headers={'User-Agent': 'Definitely-Not-The-Python-Requests-Library'})
>>> rsp
<Response [200]>
>>>
Legacy Via used to forward the User-Agent
of the end-user's browser and append a Hypothesis-Via
token. See https://github.com/hypothesis/legacyvia/blob/285a00676f9ec7deb64e9a91ea0c1d008ad86ee8/via/app.py#L34. Since this is a proven approach it might be wise to do the same.
User who original reported recently pinged us to see if there had been any update https://app.hubspot.com/contacts/6291320/ticket/427633584/
I don't think the status has changed here. It might be a fairly easy thing to implement original Via's behavior as described in https://github.com/hypothesis/support/issues/207#issuecomment-853919231.
Original user is asking again if this can be looked at: https://app.hubspot.com/contacts/6291320/ticket/1125544356
Per user report https://app.hubspot.com/contacts/6291320/ticket/427633584/
Attempting to open a certain PDF through Via, the loading bar stops part way and Hypothesis sidebar never appears.
Example: https://via.hypothes.is/https://bleakhouse.wordpress.ncsu.edu/files/2020/01/Bleak-House-Working-Notes-Transcriptions.pdf
Opening the PDF without Via and activating Hypothesis extension manually works fine.
Notes from @robertknight on 2022-09-26:
The problem happens when Via fetches the URL to determine the file type (HTML vs PDF). The original URL (https://bleakhouse.wordpress.ncsu.edu/files/2020/01/Bleak-House-Working-Notes-Transcriptions.pdf) returns a 403 response when fetched using the Python requests' library's default User-Agent header. This response causes Via to try to serve the response using ViaHTML (as HTML) rather than Via (as a PDF). When ViaHTML serves the PDF, it gets loaded in the browser's built-in PDF viewer, instead of PDF.js.
I believe we can fix the problem by making Via proxy the browser's user agent when it fetches the URL, to determine the content type.