Closed eggplants closed 4 years ago
I needed to figure out a quick fix for this same problem, and I ended up writing this. It's not the most exact or beautifully written piece of code in the world, but it works for my purposes. Maybe it'll work for yours.
# MIT License
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import datetime
from time import sleep
def archive_url(
url,
timeout=100,
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
):
"""Submits a URL to WebArchive's Save Page Now Feature (working as of 2020-07-31 on Python 3.6.10)
Keyword arguments:
url -- The url you want to archive
timeout -- Max number of seconds you're willing to wait
user_agent -- You can pass a custom user agent here
"""
# POST Request
headers = {
'authority': 'web.archive.org',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'origin': 'https://web.archive.org',
'content-type': 'application/x-www-form-urlencoded',
'user-agent': user_agent,
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://web.archive.org/save',
'accept-language': 'en-US,en;q=0.9,de;q=0.8',
}
data = {
'url': url,
'capture_all': 'on'
}
r = requests.post(f'https://web.archive.org/save/{url}', headers=headers, data=data)
# BS4 get SCRIPTS and find watchJob arguments
soup = BeautifulSoup(r.content, 'html.parser')
scripts = soup.find_all("script")
job_id = None
for script in scripts:
string = script.string
if string and "watchJob" in string:
args_string_list = string.strip().split('"')
job_id = args_string_list[1]
break
assert job_id is not None, "Couldn't find job_id in html"
# Request status of the job
out_url = None
was_pending = False
wait_time = 0
while wait_time < timeout:
r = requests.get(f"https://web.archive.org/save/status/{job_id}?_t={datetime.datetime.now().timestamp()}", headers=headers)
rj = r.json()
if rj.get('status', 'none') == "pending":
was_pending = True
if rj.get('status', 'none') == "success":
original_url = rj.get('original_url', 'none')
ext_url = f"/web/{rj['timestamp']}/{rj['original_url']}"
out_url = urljoin('https://web.archive.org', ext_url)
break
seconds_to_wait = int(r.headers.get("Retry-After", 5))
print(f"[{wait_time} seconds elapsed] Waiting for archive to complete...")
wait_time += seconds_to_wait
sleep(seconds_to_wait)
assert out_url is not None, f"Process did not complete after {timeout} seconds"
out = {
"original_url": original_url,
"archive_url": out_url,
"from_cache": was_pending == False
}
return out
if __name__ == "__main__":
url = "https://ultimateframedata.com/"
print(archive_url(url))
I'll try it soon...
Is there a way we can reliably recreate this error?
I can reproduce with savepagenow -c https://www.google.com -ua "spn test"
with savepagenow==1.0.1 running on Python 3.8.2 and Ubuntu 20.04 LTS.
Not sure, but seems like the wayback machine API changed? At least happens to me every time for every url which I tried...
Anyway, a quick and dirty fix to accommodate the change (if it is indeed an API change) is to comment lines 55-62 and insert the line archive_url = domain + response.headers['Content-Location']
after line 53 in file savepagenow/api.py
.
savepagenow/api.py
should then look like this (highlighted are the changes):
FWIW, this curl HEAD request seems to succeed:
$ curl --location --head --compressed --max-time 60 --user-agent 'just-curl' https://web.archive.org/save/https://www.whitehouse.gov/news/page/700/
And here is the (derived) URL and the headers returned:
https://web.archive.org/web/20200908001519/https://www.whitehouse.gov/news/page/700/
HTTP/2 200
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:15:20 GMT
content-type: text/html;charset=utf-8
content-location: /web/20200908001519/https://www.whitehouse.gov/news/page/700/
set-cookie: JSESSIONID=1A57B723E35BF79022A710B60D7BE7A9; Path=/; HttpOnly
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-vary: Accept-Encoding
x-archive-guessed-charset: UTF-8
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-connection: close
x-archive-orig-content-type: text/html; charset=UTF-8
x-archive-orig-date: Tue, 08 Sep 2020 00:15:19 GMT
x-archive-orig-accept-ranges: bytes
x-archive-orig-server-timing: origin; dur=279
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-app-server: wwwb-app52
x-ts: 200
x-location: save-sync
x-cache-key: httpsweb.archive.org/save/https://www.whitehouse.gov/news/page/700/US
x-page-cache: MISS
content-encoding: gzip
I'm not familiar with what the web.archive.org/save
endpoint returned previously, but it looks like the current iteration does not return Link
among its headers
And no matter how recent its most current snapshot of a URL is, making a request to the web.archive.org/save
endpoint will always result in a response in which content-location
contains a snapshot URL corresponding to the timestamp of when the request was made.
For example, I repeated the above curl
request and got the same set of headers, with content-location
corresponding to the timestamp of when I made the request (a couple minutes after my initial request):
https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/
HTTP/2 200
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:16:40 GMT
content-type: text/html;charset=utf-8
content-location: /web/20200908001640/https://www.whitehouse.gov/news/page/700/
set-cookie: JSESSIONID=BAC7D0EFD44943C953D4C3C; Path=/; HttpOnly
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-vary: Accept-Encoding
x-archive-guessed-charset: UTF-8
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-connection: close
x-archive-orig-content-type: text/html; charset=UTF-8
x-archive-orig-date: Tue, 08 Sep 2020 00:16:40 GMT
x-archive-orig-accept-ranges: bytes
x-archive-orig-server-timing: edge; dur=2
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-app-server: wwwb-app52
x-ts: 200
x-location: save-sync
x-cache-key: httpsweb.archive.org/save/https://www.whitehouse.gov/news/page/700/US
x-page-cache: MISS
content-encoding: gzip
However, if I do a curl --head
request for the URL returned by that most-recent-by-just-a-couple-minutes URL, I'll get a response that includes Link
among its headers:
$ curl --head https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/
HTTP/2 200
server: nginx/1.15.8
date: Tue, 08 Sep 2020 00:19:57 GMT
content-type: text/html; charset=UTF-8
content-length: 94237
x-archive-orig-accept-ranges: bytes
x-archive-orig-x-akamai-transformed: 9 - 0 pmb=mRUM,1
x-archive-orig-vary: Accept-Encoding
x-archive-orig-date: Tue, 08 Sep 2020 00:16:40 GMT
x-archive-orig-connection: close
x-archive-orig-server-timing: cdn-cache; desc=HIT
x-archive-orig-server-timing: edge; dur=2
x-archive-orig-strict-transport-security: max-age=31536000 ; includeSubDomains ; preload
x-archive-orig-x-frame-options: SAMEORIGIN
x-archive-orig-content-length: 86892
x-archive-guessed-content-type: text/html
x-archive-guessed-charset: utf-8
memento-datetime: Tue, 08 Sep 2020 00:16:40 GMT
link: <https://www.whitehouse.gov/news/page/700/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.whitehouse.gov/news/page/700/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.whitehouse.gov/news/page/700/>; rel="timegate", <https://web.archive.org/web/20191209230029/https://www.whitehouse.gov/news/page/700/>; rel="first memento"; datetime="Mon, 09 Dec 2019 23:00:29 GMT", <https://web.archive.org/web/20200908001519/https://www.whitehouse.gov/news/page/700/>; rel="prev memento"; datetime="Tue, 08 Sep 2020 00:15:19 GMT", <https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/>; rel="memento"; datetime="Tue, 08 Sep 2020 00:16:40 GMT", <https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/>; rel="last memento"; datetime="Tue, 08 Sep 2020 00:16:40 GMT"
content-security-policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org
x-archive-src: live-20200907233222-wwwb-app52.us.archive.org.warc.gz
server-timing: PetaboxLoader3.resolve;dur=46.564307, CDXLines.iter;dur=33.142667, load_resource;dur=50.314292, captures_list;dur=274.855653, exclusion.robots.policy;dur=0.310041, esindex;dur=0.018585, exclusion.robots;dur=0.331817, PetaboxLoader3.datanode;dur=161.893453, LoadShardBlock;dur=217.201810, RedisCDXSource;dur=17.708182
x-app-server: wwwb-app41
x-ts: 200
x-location: All
x-cache-key: httpsweb.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/US
x-page-cache: HIT
So it seems the breakage is happening because savepagenow
expects Link/memento to be included on the first response from web.archive.org/save
. However, that header info only seems to appear when doing a request on an actual snapshot URL, e.g. https://web.archive.org/web/20200908001640/https://www.whitehouse.gov/news/page/700/
.
Maybe it's not worth checking for the 'Link' header, i.e.
header_links = parse_header_links(response.headers['Link'])
Hmm. Sounds like we've seen a series of small changes to responses over the last couple months frustrating our efforts here. Thank you for the patience all. I am going to inquire with archive.org about this.
Here's the headers I get back this morning.
{
'Server': 'nginx/1.15.8',
'Date': 'Tue, 08 Sep 2020 15:20:24 GMT',
'Content-Type': 'text/html;charset=utf-8',
'Transfer-Encoding': 'chunked',
'Connection': 'keep-alive',
'Content-Location': '/web/20200908152024/https://www.latimes.com/',
'Set-Cookie': 'JSESSIONID=08602978E98325B0A00FCDB374EF52F3; Path=/; HttpOnly',
'X-Archive-Orig-X-Amz-Cf-Id': 'w8k0ZEcdhaJB2rLggI7_r-Hb2MSbmzwNwl49WJOeJw2M1udibE1-3g==',
'X-Archive-Orig-Vary': 'Accept-Encoding',
'X-Archive-Guessed-Charset': 'UTF-8',
'X-Archive-Orig-Server': 'Apache Tomcat',
'X-Archive-Orig-Connection': 'close',
'X-Archive-Orig-X-Powered-By': 'Brightspot',
'X-Archive-Orig-Content-Security-Policy': 'upgrade-insecure-requests',
'X-Archive-Orig-Age': '111',
'X-Archive-Orig-Date': 'Tue, 08 Sep 2020 15:18:33 GMT',
'X-Archive-Orig-X-Cache': 'Hit from cloudfront',
'X-Archive-Orig-Content-Type': 'text/html;charset=UTF-8',
'X-Archive-Orig-Via': '1.1 58b53da3f7d231b76d30fcffbf4945a1.cloudfront.net (CloudFront)',
'X-Archive-Orig-X-Amz-Cf-Pop': 'SFO20-C1',
'X-Archive-Orig-Cache-Control': 'max-age=180',
'X-Archive-Orig-Transfer-Encoding': 'chunked',
'X-App-Server': 'wwwb-app14',
'X-ts': '200',
'X-location': 'save-sync',
'X-Cache-Key': 'httpsweb.archive.org/save/https://www.latimes.com/US',
'X-Page-Cache': 'MISS',
'Content-Encoding': 'gzip'
}
I've pushed a change as proposed here live in version 1.1.0. @dannguyen and @eggplants, tell me if it fixes things for you.
The standard use-case (saving a page that actually exists) works for me, thanks!
Also tried an obvious error situation, e.g. $ savepagenow https://myksdlfasdklfjj.com
and got a WaybackRuntimeError, as expected:
File "/Users/dan/.pyenv/versions/3.8.5/lib/python3.8/site-packages/savepagenow/api.py", line 47, in capture
raise WaybackRuntimeError(error_header)
savepagenow.exceptions.WaybackRuntimeError: WaybackException: java.net.UnknownHostException: myksdlfasdklfjj.com: Name or service not known
When I was running this code: