palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
https://palewi.re/docs/savepagenow/
MIT License
168 stars 23 forks source link

2020-12-13 RuntimeError when submitting an URL #30

Closed HeuristicPerson closed 3 years ago

HeuristicPerson commented 3 years ago

I have a script that automatically creates a backup of two of my websites using this module that recently stopped working.

Now I'm just getting this error. Any idea what the problem could be?

savepagenow.api.WaybackRuntimeError: {'headers': {'X-Archive-Orig-Connection': 'keep-alive', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'Transfer-Encoding': 'chunked', 'X-Archive-Guessed-Charset': 'utf-8', 'X-Archive-Orig-Link': '</series/los-verdaderos-cazafantasmas>; rel="canonical",</node/1697>; rel="shortlink"', 'X-Archive-Orig-Content-Length': '3834', 'X-Archive-Orig-Date': 'Sun, 13 Dec 2020 18:33:18 GMT', 'X-Archive-Orig-Etag': '"1607884395-1"', 'X-Archive-Orig-X-Drupal-Cache': 'HIT', 'X-ts': '200', 'X-Archive-Orig-Content-Language': 'es', 'X-Archive-Screenname': '0', 'X-Archive-Orig-X-Generator': 'Drupal 7 (http://drupal.org)', 'Link': '<http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="original", <https://web.archive.org/web/timemap/link/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="timegate", <https://web.archive.org/web/20190921100349/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="first memento"; datetime="Sat, 21 Sep 2019 10:03:49 GMT", <https://web.archive.org/web/20201103131802/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="prev memento"; datetime="Tue, 03 Nov 2020 13:18:02 GMT", <https://web.archive.org/web/20201213183318/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="memento"; datetime="Sun, 13 Dec 2020 18:33:18 GMT", <https://web.archive.org/web/20201213183318/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmas>; rel="last memento"; datetime="Sun, 13 Dec 2020 18:33:18 GMT"', 'Memento-Datetime': 'Sun, 13 Dec 2020 18:33:18 GMT', 'X-Archive-Orig-Content-Encoding': 'gzip', 'Date': 'Sun, 13 Dec 2020 18:33:31 GMT', 'X-Archive-Orig-Cache-Control': 'public, max-age=86400', 'X-Archive-Orig-Vary': 'Cookie,Accept-Encoding', 'X-Cache-Key': 'httpsweb.archive.org/web/20201213183318/http://mediadb.geeklogger.com/series/los-verdaderos-cazafantasmasGB', 'X-location': 'All', 'X-Archive-Orig-Expires': 'Sun, 19 Nov 1978 05:00:00 GMT', 'X-Archive-Orig-Server': 'nginx/1.10.3', 'Server': 'nginx/1.15.8', 'Connection': 'keep-alive', 'X-Archive-Orig-Last-Modified': 'Sun, 13 Dec 2020 18:33:15 GMT', 'X-Archive-Orig-X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip', 'X-Archive-Src': 'spn2-20201213171403-wwwb-spn10.us.archive.org-8002.warc.gz', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Page-Cache': 'MISS', 'X-App-Server': 'wwwb-app104', 'Server-Timing': 'PetaboxLoader3.datanode;dur=51.247653, exclusion.robots.policy;dur=0.257083, esindex;dur=0.026107, LoadShardBlock;dur=82.967475, exclusion.robots;dur=0.273482, captures_list;dur=108.252807, load_resource;dur=70.175761, CDXLines.iter;dur=19.034766, RedisCDXSource;dur=1.460667', 'Content-Type': 'text/html; charset=utf-8'}, 'status_code': 200}

palewire commented 3 years ago

We've been wrestling with some unpredictable changes to the Wayback Machine in the last few months. That might be the case here. Are you still having the issue?

HeuristicPerson commented 3 years ago

Yes same result today. I tried replacing savepagenow with waybackpy to check whether the API has changed or archive's service stopped working; even though it works, it's much slower than savepagenow used to be, like ten times slower.

HeuristicPerson commented 3 years ago

Weird, I'm still getting the same error but in the response there is an archive(dot)org URL with the saved page. One hour ago that URL opened the cached version of the page without any kind of archive(dot)org decoration (the typical top bar indicating the date of the cache, how many times the page has changed and so on). Now, the URL shows the cached page with the full decoration.

As you said, it looks they are making changes. If this situation persists, it's easy to catch the exception and extract the cached page url from there. I'll keep an eye on it.

Regards

palewire commented 3 years ago

Over the last few months, I've observed two different types of response headers. Sometimes the archive url is in the content-location header, other times it is in the Link header in memento format. I don't know why we are seeing these varying results. What I've done is add a fallback system in our parser that will try both before it gives up. I will release a new version today with that change.