palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
https://palewi.re/docs/savepagenow/
MIT License
168 stars 23 forks source link

error when calling capture with a specific url #23

Closed dportabella closed 4 years ago

dportabella commented 4 years ago

using the latest version of this repo, capture_or_cache works on "http://www.example.com/", but it fails on "http://worldtimeapi.org/api/timezone/Europe/London.txt"

>>> import savepagenow

# works ok, as expected:
>>> savepagenow.capture_or_cache("http://www.example.com/")
('https://web.archive.org/web/20200630123047/http://www.example.com/', True)

# error:
>>> savepagenow.capture_or_cache("http://worldtimeapi.org/api/timezone/Europe/London.txt")
Traceback (most recent call last):
  File "/private/tmp/savepagenow/savepagenow/api.py", line 50, in capture
    archive_id = response.headers['Content-Location']
  File "/usr/local/lib/python3.7/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/private/tmp/savepagenow/savepagenow/api.py", line 84, in capture_or_cache
    return capture(target_url, user_agent=user_agent, accept_cache=False), True
  File "/private/tmp/savepagenow/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(dict(status_code=response.status_code, headers=response.headers))
savepagenow.api.WaybackRuntimeError: {'status_code': 200, 'headers': {'Server': 'nginx/1.15.8', 'Date': 'Tue, 30 Jun 2020 12:31:00 GMT', 'Content-Type': 'text/plain; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Connection': 'close', 'X-Archive-Orig-Access-Control-Allow-Credentials': 'true', 'X-Archive-Orig-Access-Control-Allow-Origin': '*', 'X-Archive-Orig-Access-Control-Expose-Headers': '', 'X-Archive-Orig-Cache-Control': 'max-age=0, private, must-revalidate', 'X-Archive-Orig-Content-Length': '360', 'X-Archive-Orig-Cross-Origin-Window-Policy': 'deny', 'X-Archive-Orig-Date': 'Tue, 30 Jun 2020 12:30:56 GMT', 'X-Archive-Orig-Server': 'Cowboy', 'X-Archive-Orig-X-Content-Type-Options': 'nosniff', 'X-Archive-Orig-X-Download-Options': 'noopen', 'X-Archive-Orig-X-Frame-Options': 'SAMEORIGIN', 'X-Archive-Orig-X-Permitted-Cross-Domain-Policies': 'none', 'X-Archive-Orig-X-Request-Id': '7674e540-6b1f-4058-8330-3b0f42d9dad3', 'X-Archive-Orig-X-Runtime': '2ms', 'X-Archive-Orig-X-Xss-Protection': '1; mode=block', 'X-Archive-Orig-Via': '1.1 vegur', 'Cache-Control': 'max-age=1800', 'X-Archive-Guessed-Content-Type': 'text/plain', 'X-Archive-Guessed-Encoding': 'utf-8', 'Memento-Datetime': 'Tue, 30 Jun 2020 12:30:56 GMT', 'Link': '<http://worldtimeapi.org/api/timezone/Europe/London.txt>; rel="original", <https://web.archive.org/web/timemap/link/http://worldtimeapi.org/api/timezone/Europe/London.txt>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://worldtimeapi.org/api/timezone/Europe/London.txt>; rel="timegate", <https://web.archive.org/web/20181114003721/http://worldtimeapi.org:80/api/timezone/Europe/london.txt>; rel="first memento"; datetime="Wed, 14 Nov 2018 00:37:21 GMT", <https://web.archive.org/web/20200630123009/http://worldtimeapi.org/api/timezone/Europe/London.txt>; rel="prev memento"; datetime="Tue, 30 Jun 2020 12:30:09 GMT", <https://web.archive.org/web/20200630123056/http://worldtimeapi.org/api/timezone/Europe/London.txt>; rel="memento"; datetime="Tue, 30 Jun 2020 12:30:56 GMT", <https://web.archive.org/web/20200630123056/http://worldtimeapi.org/api/timezone/Europe/London.txt>; rel="last memento"; datetime="Tue, 30 Jun 2020 12:30:56 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'live-20200630122811-wwwb-app15.us.archive.org.warc.gz', 'Server-Timing': 'exclusion.robots.policy;dur=0.274602, PetaboxLoader3.resolve;dur=173.598303, esindex;dur=0.020509, CDXLines.iter;dur=19.062754, exclusion.robots;dur=0.294044, captures_list;dur=370.117709, RedisCDXSource;dur=8.558711, PetaboxLoader3.datanode;dur=133.546836, LoadShardBlock;dur=335.557387, load_resource;dur=59.129336', 'X-App-Server': 'wwwb-app29', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20200630123056/http://worldtimeapi.org/api/timezone/Europe/London.txtBR', 'X-Page-Cache': 'MISS'}}
palewire commented 4 years ago

I think that something has clearly changed in the headers archive.org is providing. I'm not sure I know what yet.

palewire commented 4 years ago

This was tackled in #25 and, I hope, fixed with the new version shipped today. https://pypi.org/project/savepagenow/1.0.0/