palewire / savepagenow

A simple Python wrapper and command-line interface for’s "Save Page Now" capturing service
MIT License
168 stars 23 forks source link

KeyError: 'link' in capture #26

Closed eggplants closed 4 years ago

eggplants commented 4 years ago

When I was running this code:

$ python3.8
{'Server': 'nginx/1.15.8', 'Date': 'Wed, 15 Jul 2020 11:59:50 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache'}
capture: 42.963274240493774 sec.
{'Server': 'nginx/1.15.8', 'Date': 'Wed, 15 Jul 2020 12:01:37 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache'}
capture_or_cache: 97.4388906955719 sec.
Traceback (most recent call last):
  File "", line 28, in <module>
  File "", line 24, in main
    measure(fun, url)
  File "", line 8, in measure                                                               
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/savepagenow/", line 55, in capture
    header_links = parse_header_links(response.headers['Link'])
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/requests/", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'link'
tlcaputi commented 4 years ago

I needed to figure out a quick fix for this same problem, and I ended up writing this. It's not the most exact or beautifully written piece of code in the world, but it works for my purposes. Maybe it'll work for yours.

# MIT License

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import datetime
from time import sleep

def archive_url(
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'

    """Submits a URL to WebArchive's Save Page Now Feature (working as of 2020-07-31 on Python 3.6.10)

    Keyword arguments:
    url -- The url you want to archive
    timeout -- Max number of seconds you're willing to wait
    user_agent -- You can pass a custom user agent here


    # POST Request
    headers = {
        'authority': '',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'origin': '',
        'content-type': 'application/x-www-form-urlencoded',
        'user-agent': user_agent,
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': '',
        'accept-language': 'en-US,en;q=0.9,de;q=0.8',

    data = {
        'url': url,
        'capture_all': 'on'

    r ='{url}', headers=headers, data=data)

    # BS4 get SCRIPTS and find watchJob arguments
    soup = BeautifulSoup(r.content, 'html.parser')
    scripts = soup.find_all("script")

    job_id = None
    for script in scripts:
        string = script.string
        if string and "watchJob" in string:
            args_string_list = string.strip().split('"')
            job_id = args_string_list[1]

    assert job_id is not None, "Couldn't find job_id in html"

    # Request status of the job
    out_url = None
    was_pending = False
    wait_time = 0
    while wait_time < timeout:

        r = requests.get(f"{job_id}?_t={}", headers=headers)
        rj = r.json()

        if rj.get('status', 'none') == "pending":
            was_pending = True

        if rj.get('status', 'none') == "success":
            original_url = rj.get('original_url', 'none')
            ext_url = f"/web/{rj['timestamp']}/{rj['original_url']}"
            out_url = urljoin('', ext_url)

        seconds_to_wait = int(r.headers.get("Retry-After", 5))
        print(f"[{wait_time} seconds elapsed] Waiting for archive to complete...")
        wait_time += seconds_to_wait

    assert out_url is not None, f"Process did not complete after {timeout} seconds"

    out = {
        "original_url": original_url,
        "archive_url": out_url,
        "from_cache": was_pending == False

    return out

if __name__ == "__main__":
    url = ""
eggplants commented 4 years ago

I'll try it soon...

palewire commented 4 years ago

Is there a way we can reliably recreate this error?

tlcaputi commented 4 years ago

I can reproduce with savepagenow -c -ua "spn test" with savepagenow==1.0.1 running on Python 3.8.2 and Ubuntu 20.04 LTS.

zakyum commented 4 years ago

Not sure, but seems like the wayback machine API changed? At least happens to me every time for every url which I tried...

Anyway, a quick and dirty fix to accommodate the change (if it is indeed an API change) is to comment lines 55-62 and insert the line archive_url = domain + response.headers['Content-Location'] after line 53 in file savepagenow/

savepagenow/ should then look like this (highlighted are the changes):


dannguyen commented 4 years ago

FWIW, this curl HEAD request seems to succeed:

$ curl --location --head --compressed --max-time 60 --user-agent 'just-curl'

And here is the (derived) URL and the headers returned:

HTTP/2 200 
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:15:20 GMT
content-type: text/html;charset=utf-8
content-location: /web/20200908001519/
set-cookie: JSESSIONID=1A57B723E35BF79022A710B60D7BE7A9; Path=/; HttpOnly
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-vary: Accept-Encoding
x-archive-guessed-charset: UTF-8
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-connection: close
x-archive-orig-content-type: text/html; charset=UTF-8
x-archive-orig-date: Tue, 08 Sep 2020 00:15:19 GMT
x-archive-orig-accept-ranges: bytes
x-archive-orig-server-timing: origin; dur=279
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-app-server: wwwb-app52
x-ts: 200
x-location: save-sync
x-page-cache: MISS
content-encoding: gzip

I'm not familiar with what the endpoint returned previously, but it looks like the current iteration does not return Link among its headers

And no matter how recent its most current snapshot of a URL is, making a request to the endpoint will always result in a response in which content-location contains a snapshot URL corresponding to the timestamp of when the request was made.

For example, I repeated the above curl request and got the same set of headers, with content-location corresponding to the timestamp of when I made the request (a couple minutes after my initial request):

HTTP/2 200 
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:16:40 GMT
content-type: text/html;charset=utf-8
content-location: /web/20200908001640/
set-cookie: JSESSIONID=BAC7D0EFD44943C953D4C3C; Path=/; HttpOnly
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-vary: Accept-Encoding
x-archive-guessed-charset: UTF-8
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-connection: close
x-archive-orig-content-type: text/html; charset=UTF-8
x-archive-orig-date: Tue, 08 Sep 2020 00:16:40 GMT
x-archive-orig-accept-ranges: bytes
x-archive-orig-server-timing: edge; dur=2
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-app-server: wwwb-app52
x-ts: 200
x-location: save-sync
x-page-cache: MISS
content-encoding: gzip

However, if I do a curl --head request for the URL returned by that most-recent-by-just-a-couple-minutes URL, I'll get a response that includes Link among its headers:

$ curl --head
HTTP/2 200 

server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:19:57 GMT
content-type: text/html; charset=UTF-8
content-length: 94237
x-archive-orig-accept-ranges: bytes
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-archive-orig-vary: Accept-Encoding
x-archive-orig-date: Tue, 08 Sep 2020 00:16:40 GMT
x-archive-orig-connection: close
x-archive-orig-server-timing: cdn-cache; desc=HIT
x-archive-orig-server-timing: edge; dur=2
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-content-length: 86892
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Tue, 08 Sep 2020 00:16:40 GMT
link: <>; rel="original", <>; rel="timemap"; type="application/link-format", <>; rel="timegate", <>; rel="first memento"; datetime="Mon, 09 Dec 2019 23:00:29 GMT", <>; rel="prev memento"; datetime="Tue, 08 Sep 2020 00:15:19 GMT", <>; rel="memento"; datetime="Tue, 08 Sep 2020 00:16:40 GMT", <>; rel="last memento"; datetime="Tue, 08 Sep 2020 00:16:40 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob:
server-timing: PetaboxLoader3.resolve;dur=46.564307, CDXLines.iter;dur=33.142667, load_resource;dur=50.314292, captures_list;dur=274.855653, exclusion.robots.policy;dur=0.310041, esindex;dur=0.018585, exclusion.robots;dur=0.331817, PetaboxLoader3.datanode;dur=161.893453, LoadShardBlock;dur=217.201810, RedisCDXSource;dur=17.708182
x-app-server: wwwb-app41
x-ts: 200
x-location: All
x-page-cache: HIT

So it seems the breakage is happening because savepagenow expects Link/memento to be included on the first response from However, that header info only seems to appear when doing a request on an actual snapshot URL, e.g.

Suggested fix

Maybe it's not worth checking for the 'Link' header, i.e.

    header_links = parse_header_links(response.headers['Link'])
palewire commented 4 years ago

Hmm. Sounds like we've seen a series of small changes to responses over the last couple months frustrating our efforts here. Thank you for the patience all. I am going to inquire with about this.

palewire commented 4 years ago

Here's the headers I get back this morning.

  'Server': 'nginx/1.15.8',
  'Date': 'Tue, 08 Sep 2020 15:20:24 GMT',
  'Content-Type': 'text/html;charset=utf-8',
  'Transfer-Encoding': 'chunked',
  'Connection': 'keep-alive',
  'Content-Location': '/web/20200908152024/',
  'Set-Cookie': 'JSESSIONID=08602978E98325B0A00FCDB374EF52F3; Path=/; HttpOnly',
  'X-Archive-Orig-X-Amz-Cf-Id': 'w8k0ZEcdhaJB2rLggI7_r-Hb2MSbmzwNwl49WJOeJw2M1udibE1-3g==',
  'X-Archive-Orig-Vary': 'Accept-Encoding',
  'X-Archive-Guessed-Charset': 'UTF-8',
  'X-Archive-Orig-Server': 'Apache Tomcat',
  'X-Archive-Orig-Connection': 'close',
  'X-Archive-Orig-X-Powered-By': 'Brightspot',
  'X-Archive-Orig-Content-Security-Policy': 'upgrade-insecure-requests',
  'X-Archive-Orig-Age': '111',
  'X-Archive-Orig-Date': 'Tue, 08 Sep 2020 15:18:33 GMT',
  'X-Archive-Orig-X-Cache': 'Hit from cloudfront',
  'X-Archive-Orig-Content-Type': 'text/html;charset=UTF-8',
  'X-Archive-Orig-Via': '1.1 (CloudFront)',
  'X-Archive-Orig-X-Amz-Cf-Pop': 'SFO20-C1',
  'X-Archive-Orig-Cache-Control': 'max-age=180',
  'X-Archive-Orig-Transfer-Encoding': 'chunked',
  'X-App-Server': 'wwwb-app14',
  'X-ts': '200',
  'X-location': 'save-sync',
  'X-Cache-Key': '',
  'X-Page-Cache': 'MISS',
  'Content-Encoding': 'gzip'
palewire commented 4 years ago

I've pushed a change as proposed here live in version 1.1.0. @dannguyen and @eggplants, tell me if it fixes things for you.

dannguyen commented 4 years ago

The standard use-case (saving a page that actually exists) works for me, thanks!

Also tried an obvious error situation, e.g. $ savepagenow and got a WaybackRuntimeError, as expected:

 File "/Users/dan/.pyenv/versions/3.8.5/lib/python3.8/site-packages/savepagenow/", line 47, in capture
     raise WaybackRuntimeError(error_header)
savepagenow.exceptions.WaybackRuntimeError: WaybackException: Name or service not known