palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
https://palewi.re/docs/savepagenow/
MIT License
167 stars 23 forks source link

KeyError: 'link' in capture #26

Closed eggplants closed 3 years ago

eggplants commented 4 years ago

When I was running this code:

$ python3.8 test_save.py
{'Server': 'nginx/1.15.8', 'Date': 'Wed, 15 Jul 2020 11:59:50 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache'}
capture: 42.963274240493774 sec.
{'Server': 'nginx/1.15.8', 'Date': 'Wed, 15 Jul 2020 12:01:37 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache'}
capture_or_cache: 97.4388906955719 sec.
Traceback (most recent call last):
  File "test_save.py", line 28, in <module>
    main()
  File "test_save.py", line 24, in main
    measure(fun, url)
  File "test_save.py", line 8, in measure                                                               
    print(f(*arg))
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/savepagenow/api.py", line 55, in capture
    header_links = parse_header_links(response.headers['Link'])
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'link'
tlcaputi commented 3 years ago

I needed to figure out a quick fix for this same problem, and I ended up writing this. It's not the most exact or beautifully written piece of code in the world, but it works for my purposes. Maybe it'll work for yours.


# MIT License

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import datetime
from time import sleep

def archive_url(
    url, 
    timeout=100, 
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    ):

    """Submits a URL to WebArchive's Save Page Now Feature (working as of 2020-07-31 on Python 3.6.10)

    Keyword arguments:
    url -- The url you want to archive
    timeout -- Max number of seconds you're willing to wait
    user_agent -- You can pass a custom user agent here

    """

    # POST Request
    headers = {
        'authority': 'web.archive.org',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'origin': 'https://web.archive.org',
        'content-type': 'application/x-www-form-urlencoded',
        'user-agent': user_agent,
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://web.archive.org/save',
        'accept-language': 'en-US,en;q=0.9,de;q=0.8',
    }

    data = {
        'url': url,
        'capture_all': 'on'
    }

    r = requests.post(f'https://web.archive.org/save/{url}', headers=headers, data=data)

    # BS4 get SCRIPTS and find watchJob arguments
    soup = BeautifulSoup(r.content, 'html.parser')
    scripts = soup.find_all("script")

    job_id = None
    for script in scripts:
        string = script.string
        if string and "watchJob" in string:
            args_string_list = string.strip().split('"')
            job_id = args_string_list[1]
            break

    assert job_id is not None, "Couldn't find job_id in html"

    # Request status of the job
    out_url = None
    was_pending = False
    wait_time = 0
    while wait_time < timeout:

        r = requests.get(f"https://web.archive.org/save/status/{job_id}?_t={datetime.datetime.now().timestamp()}", headers=headers)
        rj = r.json()

        if rj.get('status', 'none') == "pending":
            was_pending = True

        if rj.get('status', 'none') == "success":
            original_url = rj.get('original_url', 'none')
            ext_url = f"/web/{rj['timestamp']}/{rj['original_url']}"
            out_url = urljoin('https://web.archive.org', ext_url)
            break

        seconds_to_wait = int(r.headers.get("Retry-After", 5))
        print(f"[{wait_time} seconds elapsed] Waiting for archive to complete...")
        wait_time += seconds_to_wait
        sleep(seconds_to_wait)

    assert out_url is not None, f"Process did not complete after {timeout} seconds"

    out = {
        "original_url": original_url,
        "archive_url": out_url,
        "from_cache": was_pending == False
    }

    return out

if __name__ == "__main__":
    url = "https://ultimateframedata.com/"
    print(archive_url(url))
eggplants commented 3 years ago

I'll try it soon...

palewire commented 3 years ago

Is there a way we can reliably recreate this error?

tlcaputi commented 3 years ago

I can reproduce with savepagenow -c https://www.google.com -ua "spn test" with savepagenow==1.0.1 running on Python 3.8.2 and Ubuntu 20.04 LTS.

zakyum commented 3 years ago

Not sure, but seems like the wayback machine API changed? At least happens to me every time for every url which I tried...

Anyway, a quick and dirty fix to accommodate the change (if it is indeed an API change) is to comment lines 55-62 and insert the line archive_url = domain + response.headers['Content-Location'] after line 53 in file savepagenow/api.py.

savepagenow/api.py should then look like this (highlighted are the changes):

image

dannguyen commented 3 years ago

FWIW, this curl HEAD request seems to succeed:

$ curl --location --head --compressed --max-time 60 --user-agent 'just-curl' https://web.archive.org/save/https://www.whitehouse.gov/news/page/700/

And here is the (derived) URL and the headers returned:

https://web.archive.org/web/20200908001519/https://www.whitehouse.gov/news/page/700/

HTTP/2 200 
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:15:20 GMT
content-type: text/html;charset=utf-8
content-location: /web/20200908001519/https://www.whitehouse.gov/news/page/700/
set-cookie: JSESSIONID=1A57B723E35BF79022A710B60D7BE7A9; Path=/; HttpOnly
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-vary: Accept-Encoding
x-archive-guessed-charset: UTF-8
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-connection: close
x-archive-orig-content-type: text/html; charset=UTF-8
x-archive-orig-date: Tue, 08 Sep 2020 00:15:19 GMT
x-archive-orig-accept-ranges: bytes
x-archive-orig-server-timing: origin; dur=279
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-app-server: wwwb-app52
x-ts: 200
x-location: save-sync
x-cache-key: httpsweb.archive.org/save/https://www.whitehouse.gov/news/page/700/US
x-page-cache: MISS
content-encoding: gzip

I'm not familiar with what the web.archive.org/save endpoint returned previously, but it looks like the current iteration does not return Link among its headers

And no matter how recent its most current snapshot of a URL is, making a request to the web.archive.org/save endpoint will always result in a response in which content-location contains a snapshot URL corresponding to the timestamp of when the request was made.

For example, I repeated the above curl request and got the same set of headers, with content-location corresponding to the timestamp of when I made the request (a couple minutes after my initial request):

https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/

HTTP/2 200 
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:16:40 GMT
content-type: text/html;charset=utf-8
content-location: /web/20200908001640/https://www.whitehouse.gov/news/page/700/
set-cookie: JSESSIONID=BAC7D0EFD44943C953D4C3C; Path=/; HttpOnly
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-vary: Accept-Encoding
x-archive-guessed-charset: UTF-8
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-connection: close
x-archive-orig-content-type: text/html; charset=UTF-8
x-archive-orig-date: Tue, 08 Sep 2020 00:16:40 GMT
x-archive-orig-accept-ranges: bytes
x-archive-orig-server-timing: edge; dur=2
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-app-server: wwwb-app52
x-ts: 200
x-location: save-sync
x-cache-key: httpsweb.archive.org/save/https://www.whitehouse.gov/news/page/700/US
x-page-cache: MISS
content-encoding: gzip

However, if I do a curl --head request for the URL returned by that most-recent-by-just-a-couple-minutes URL, I'll get a response that includes Link among its headers:

$ curl --head https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/
HTTP/2 200 

server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:19:57 GMT
content-type: text/html; charset=UTF-8
content-length: 94237
x-archive-orig-accept-ranges: bytes
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-archive-orig-vary: Accept-Encoding
x-archive-orig-date: Tue, 08 Sep 2020 00:16:40 GMT
x-archive-orig-connection: close
x-archive-orig-server-timing: cdn-cache; desc=HIT
x-archive-orig-server-timing: edge; dur=2
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-content-length: 86892
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Tue, 08 Sep 2020 00:16:40 GMT
link: <https://www.whitehouse.gov/news/page/700/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.whitehouse.gov/news/page/700/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.whitehouse.gov/news/page/700/>; rel="timegate", <https://web.archive.org/web/20191209230029/https://www.whitehouse.gov/news/page/700/>; rel="first memento"; datetime="Mon, 09 Dec 2019 23:00:29 GMT", <https://web.archive.org/web/20200908001519/https://www.whitehouse.gov/news/page/700/>; rel="prev memento"; datetime="Tue, 08 Sep 2020 00:15:19 GMT", <https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/>; rel="memento"; datetime="Tue, 08 Sep 2020 00:16:40 GMT", <https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/>; rel="last memento"; datetime="Tue, 08 Sep 2020 00:16:40 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org
x-archive-src: live-20200907233222-wwwb-app52.us.archive.org.warc.gz
server-timing: PetaboxLoader3.resolve;dur=46.564307, CDXLines.iter;dur=33.142667, load_resource;dur=50.314292, captures_list;dur=274.855653, exclusion.robots.policy;dur=0.310041, esindex;dur=0.018585, exclusion.robots;dur=0.331817, PetaboxLoader3.datanode;dur=161.893453, LoadShardBlock;dur=217.201810, RedisCDXSource;dur=17.708182
x-app-server: wwwb-app41
x-ts: 200
x-location: All
x-cache-key: httpsweb.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/US
x-page-cache: HIT

So it seems the breakage is happening because savepagenow expects Link/memento to be included on the first response from web.archive.org/save. However, that header info only seems to appear when doing a request on an actual snapshot URL, e.g. https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/.

Suggested fix

Maybe it's not worth checking for the 'Link' header, i.e.

    header_links = parse_header_links(response.headers['Link'])
palewire commented 3 years ago

Hmm. Sounds like we've seen a series of small changes to responses over the last couple months frustrating our efforts here. Thank you for the patience all. I am going to inquire with archive.org about this.

palewire commented 3 years ago

Here's the headers I get back this morning.

{
  'Server': 'nginx/1.15.8',
  'Date': 'Tue, 08 Sep 2020 15:20:24 GMT',
  'Content-Type': 'text/html;charset=utf-8',
  'Transfer-Encoding': 'chunked',
  'Connection': 'keep-alive',
  'Content-Location': '/web/20200908152024/https://www.latimes.com/',
  'Set-Cookie': 'JSESSIONID=08602978E98325B0A00FCDB374EF52F3; Path=/; HttpOnly',
  'X-Archive-Orig-X-Amz-Cf-Id': 'w8k0ZEcdhaJB2rLggI7_r-Hb2MSbmzwNwl49WJOeJw2M1udibE1-3g==',
  'X-Archive-Orig-Vary': 'Accept-Encoding',
  'X-Archive-Guessed-Charset': 'UTF-8',
  'X-Archive-Orig-Server': 'Apache Tomcat',
  'X-Archive-Orig-Connection': 'close',
  'X-Archive-Orig-X-Powered-By': 'Brightspot',
  'X-Archive-Orig-Content-Security-Policy': 'upgrade-insecure-requests',
  'X-Archive-Orig-Age': '111',
  'X-Archive-Orig-Date': 'Tue, 08 Sep 2020 15:18:33 GMT',
  'X-Archive-Orig-X-Cache': 'Hit from cloudfront',
  'X-Archive-Orig-Content-Type': 'text/html;charset=UTF-8',
  'X-Archive-Orig-Via': '1.1 58b53da3f7d231b76d30fcffbf4945a1.cloudfront.net (CloudFront)',
  'X-Archive-Orig-X-Amz-Cf-Pop': 'SFO20-C1',
  'X-Archive-Orig-Cache-Control': 'max-age=180',
  'X-Archive-Orig-Transfer-Encoding': 'chunked',
  'X-App-Server': 'wwwb-app14',
  'X-ts': '200',
  'X-location': 'save-sync',
  'X-Cache-Key': 'httpsweb.archive.org/save/https://www.latimes.com/US',
  'X-Page-Cache': 'MISS',
  'Content-Encoding': 'gzip'
}
palewire commented 3 years ago

I've pushed a change as proposed here live in version 1.1.0. @dannguyen and @eggplants, tell me if it fixes things for you.

https://pypi.org/project/savepagenow/1.1.0/

dannguyen commented 3 years ago

The standard use-case (saving a page that actually exists) works for me, thanks!

Also tried an obvious error situation, e.g. $ savepagenow https://myksdlfasdklfjj.com and got a WaybackRuntimeError, as expected:

 File "/Users/dan/.pyenv/versions/3.8.5/lib/python3.8/site-packages/savepagenow/api.py", line 47, in capture
     raise WaybackRuntimeError(error_header)
savepagenow.exceptions.WaybackRuntimeError: WaybackException: java.net.UnknownHostException: myksdlfasdklfjj.com: Name or service not known