palewire / savepagenow

A simple Python wrapper and command-line interface for archive.org’s "Save Page Now" capturing service
https://palewi.re/docs/savepagenow/
MIT License
167 stars 23 forks source link

Both capture and capture_or_cache fail and return errors #25

Closed eggplants closed 4 years ago

eggplants commented 4 years ago

Both capture and capture_or_cache fail and return errors.

# savepagenow 0.0.13
$ eval "python3 -c 'from savepagenow import*;"{capture,capture_or_cache}"(\"http://www.example.com/\")';"|&cat>>log
$ cat log

capture:

Traceback (most recent call last):
  File "/home/eggplants/.local/lib/python3.6/site-packages/savepagenow/api.py", line 50, in capture
    archive_id = response.headers['Content-Location']
  File "/home/eggplants/.local/lib/python3.6/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/eggplants/.local/lib/python3.6/site-packages/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(dict(status_code=response.status_code, headers=response.headers))
savepagenow.api.WaybackRuntimeError:
{
  "status_code":200,
  "headers":{
    "Server":"nginx/1.15.8",
    "Date":"Sat, 11 Jul 2020 19:04:09 GMT",
    "Content-Type":"text/html; charset=UTF-8",
    "Transfer-Encoding":"chunked",
    "Connection":"keep-alive",
    "X-Archive-Orig-Content-Encoding":"gzip",
    "X-Archive-Orig-Age":"439714",
    "X-Archive-Orig-Cache-Control":"max-age=604800",
    "X-Archive-Orig-Date":"Sat, 11 Jul 2020 18:58:06 GMT",
    "X-Archive-Orig-Etag":"\"3147526947+gzip\"",
    "X-Archive-Orig-Expires":"Sat, 18 Jul 2020 18:58:06 GMT",
    "X-Archive-Orig-Last-Modified":"Thu, 17 Oct 2019 07:18:26 GMT",
    "X-Archive-Orig-Server":"ECS (sjc/4E76)",
    "X-Archive-Orig-Vary":"Accept-Encoding",
    "X-Archive-Orig-X-Cache":"HIT",
    "X-Archive-Orig-Content-Length":"648",
    "X-Archive-Guessed-Content-Type":"text/html",
    "X-Archive-Guessed-Charset":"utf-8",
    "Memento-Datetime":"Sat, 11 Jul 2020 18:58:06 GMT",
    "Link":"<http://www.example.com/>; rel=\"original\", <https://web.archive.org/web/timemap/link/http://www.example.com/>; rel=\"timemap\"; type=\"application/link-format\", <https://web.archive.org/web/http://www.example.com/>; rel=\"timegate\", <https://web.archive.org/web/20020120142510/http://example.com:80/>; rel=\"first memento\"; datetime=\"Sun, 20 Jan 2002 14:25:10 GMT\", <https://web.archive.org/web/20200711183038/http://example.com/>; rel=\"prev memento\"; datetime=\"Sat, 11 Jul 2020 18:30:38 GMT\", <https://web.archive.org/web/20200711185806/http://www.example.com/>; rel=\"memento\"; datetime=\"Sat, 11 Jul 2020 18:58:06 GMT\", <https://web.archive.org/web/20200711185806/http://www.example.com/>; rel=\"last memento\"; datetime=\"Sat, 11 Jul 2020 18:58:06 GMT\"",
    "Content-Security-Policy":"default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org",
    "X-Archive-Src":"spn2-20200711181600-wwwb-spn14.us.archive.org-8002.warc.gz",
    "Server-Timing":"load_resource;dur=89.916878, exclusion.robots;dur=1.158545, exclusion.robots.policy;dur=1.071748, CDXLines.iter;dur=150.374492, RedisCDXSource;dur=59.431015, LoadShardBlock;dur=617.530887, esindex;dur=0.083691, PetaboxLoader3.resolve;dur=56.383628, PetaboxLoader3.datanode;dur=482.086999",
    "X-App-Server":"wwwb-app29",
    "X-ts":"200",
    "X-location":"All",
    "X-Cache-Key":"httpsweb.archive.org/web/20200711185806/http://www.example.com/JP",
    "X-Page-Cache":"HIT",
    "Content-Encoding":"gzip"
  }
}

capture_or_cache:

Traceback (most recent call last):
  File "/home/eggplants/.local/lib/python3.6/site-packages/savepagenow/api.py", line 50, in capture
    archive_id = response.headers['Content-Location']
  File "/home/eggplants/.local/lib/python3.6/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/eggplants/.local/lib/python3.6/site-packages/savepagenow/api.py", line 84, in capture_or_cache
    return capture(target_url, user_agent=user_agent, accept_cache=False), True
  File "/home/eggplants/.local/lib/python3.6/site-packages/savepagenow/api.py", line 53, in capture
    raise WaybackRuntimeError(dict(status_code=response.status_code, headers=response.headers))
savepagenow.api.WaybackRuntimeError:
{
  "status_code":200,
  "headers":{
    "Server":"nginx/1.15.8",
    "Date":"Sat, 11 Jul 2020 19:04:32 GMT",
    "Content-Type":"text/html; charset=UTF-8",
    "Transfer-Encoding":"chunked",
    "Connection":"keep-alive",
    "X-Archive-Orig-Content-Encoding":"gzip",
    "X-Archive-Orig-Age":"439714",
    "X-Archive-Orig-Cache-Control":"max-age=604800",
    "X-Archive-Orig-Date":"Sat, 11 Jul 2020 18:58:06 GMT",
    "X-Archive-Orig-Etag":"\"3147526947+gzip\"",
    "X-Archive-Orig-Expires":"Sat, 18 Jul 2020 18:58:06 GMT",
    "X-Archive-Orig-Last-Modified":"Thu, 17 Oct 2019 07:18:26 GMT",
    "X-Archive-Orig-Server":"ECS (sjc/4E76)",
    "X-Archive-Orig-Vary":"Accept-Encoding",
    "X-Archive-Orig-X-Cache":"HIT",
    "X-Archive-Orig-Content-Length":"648",
    "X-Archive-Guessed-Content-Type":"text/html",
    "X-Archive-Guessed-Charset":"utf-8",
    "Memento-Datetime":"Sat, 11 Jul 2020 18:58:06 GMT",
    "Link":"<http://www.example.com/>; rel=\"original\", <https://web.archive.org/web/timemap/link/http://www.example.com/>; rel=\"timemap\"; type=\"application/link-format\", <https://web.archive.org/web/http://www.example.com/>; rel=\"timegate\", <https://web.archive.org/web/20020120142510/http://example.com:80/>; rel=\"first memento\"; datetime=\"Sun, 20 Jan 2002 14:25:10 GMT\", <https://web.archive.org/web/20200711183038/http://example.com/>; rel=\"prev memento\"; datetime=\"Sat, 11 Jul 2020 18:30:38 GMT\", <https://web.archive.org/web/20200711185806/http://www.example.com/>; rel=\"memento\"; datetime=\"Sat, 11 Jul 2020 18:58:06 GMT\", <https://web.archive.org/web/20200711185806/http://www.example.com/>; rel=\"last memento\"; datetime=\"Sat, 11 Jul 2020 18:58:06 GMT\"",
    "Content-Security-Policy":"default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org",
    "X-Archive-Src":"spn2-20200711181600-wwwb-spn14.us.archive.org-8002.warc.gz",
    "Server-Timing":"load_resource;dur=89.916878, exclusion.robots;dur=1.158545, exclusion.robots.policy;dur=1.071748, CDXLines.iter;dur=150.374492, RedisCDXSource;dur=59.431015, LoadShardBlock;dur=617.530887, esindex;dur=0.083691, PetaboxLoader3.resolve;dur=56.383628, PetaboxLoader3.datanode;dur=482.086999",
    "X-App-Server":"wwwb-app29",
    "X-ts":"200",
    "X-location":"All",
    "X-Cache-Key":"httpsweb.archive.org/web/20200711185806/http://www.example.com/JP",
    "X-Page-Cache":"HIT",
    "Content-Encoding":"gzip"
  }
}
eggplants commented 4 years ago

@palewire Is this repo no longer maintained?

pyprism commented 4 years ago

For last two days I am also facing this errors. From another service where I am not using this library, also failing . So I think wayback machine changed something from their end, its not library issue.

eggplants commented 4 years ago

Should be fixed.

eggplants commented 4 years ago

I think archive_id originally received an url like '/web/20070810113028/http://www.knams.wikimedia.org/wikimania/highquality/Wikimania05-AP1.avi'. The key of this link seem to have been moved to X-Cache-Key​.

eggplants commented 4 years ago

https://github.com/pastpages/savepagenow/blob/de2fd37f1f69f73c3b6febdd5d7ea126e23a9c6d/savepagenow/api.py#L50

Should be changed into:

 archive_id = response.headers['X-Cache-Key'][20::]
palewire commented 4 years ago

Hey, I'm here. Just been a busy few months at my day job with COVID. Clearly something has changed on the archive.org end and I need to catch up on it.

palewire commented 4 years ago

Here's a guess. My module is expecting the content-location header to be present in responses from Wayback regardless of whether or now they have been newly archived or are, as in the case of your header dump above, pulled from the cache (as indicated by "X-Page-Cache":"HIT".

You can see that here with the exception test for content-location coming before the cache header check. So if that first try fails, you never get around to testing if its cached.

https://github.com/pastpages/savepagenow/blob/ece69510e50f9058615677842179b890ecfc6875/savepagenow/api.py#L54-L67

A theory: content-location was dropped from cached responses and this bug could be fixed by putting the cache check first.

palewire commented 4 years ago

Flipping those code blocks around seems to have allowed me to retrieve cached URLs again

>> import savepagenow                                                                                                                                                                   
>>> savepagenow.capture("https://www.google.com/")                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                       
  File "<stdin>", line 1, in <module>
  File "/home/palewire/Code/savepagenow/savepagenow/api.py", line 62, in capture
    raise CachedPage("archive.org returned a cached version of this page: {}".format(
savepagenow.exceptions.CachedPage: archive.org returned a cached version of this page: https://web.archive.org/web/20200712162252/https://www.google.com/US

But when I tried my own personal blog, which is rarely cached, I see that the content-location header is no longer appears there either.

>>> savepagenow.capture("https://palewi.re/who-is-ben-welsh/")                               
Traceback (most recent call last):
  File "/home/palewire/Code/savepagenow/savepagenow/api.py", line 69, in capture
    archive_id = response.headers['Content-Location']
  File "/home/palewire/.local/share/virtualenvs/savepagenow-3AbMKd12/lib/python3.8/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'content-location'

Here's the cleaned up headers coming back from Wayback on that second request. The 'X-Page-Cache': 'MISS', header which makes me think it was archived. And based on my rough reading I suppose https://web.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/ is returned as the new memento url. I'm not sure I trust that cache key for parsing. If you agree that's the right URL to grab, you think we could parse it out of the Link tag?


{
    'status_code': 200,
    'headers': {
        'Server': 'nginx/1.15.8',
        'Date': 'Sun, 12 Jul 2020 16:42:30 GMT',
        'Content-Type': 'text/html; charset=utf-8',
        'Transfer-Encoding': 'chunked',
        'Connection': 'keep-alive',
        'X-Archive-Orig-Date': 'Wed, 17 Jun 2020 04:29:00 GMT',
        'X-Archive-Orig-Connection': 'close',
        'X-Archive-Orig-Server': 'Apache/2.4.10 (Ubuntu) mod_wsgi/3.5 Python/2.7.8',
        'X-Archive-Orig-X-Varnish-TTL': '180s',
        'X-Archive-Orig-X-Varnish': '15974906',
        'X-Archive-Orig-Age': '0',
        'X-Archive-Orig-Via': '1.1 varnish-v4',
        'X-Archive-Orig-grace': 'none',
        'X-Archive-Orig-X-Varnish-Cache': 'HIT',
        'X-Archive-Orig-Accept-Ranges': 'bytes',
        'X-Archive-Guessed-Content-Type': 'text/html',
        'X-Archive-Guessed-Charset': 'utf-8',
        'Memento-Datetime': 'Wed, 17 Jun 2020 04:29:00 GMT',
        'Link': '<https://palewi.re/who-is-ben-welsh/>; rel="original", <https://web.archive.org/web/timemap/link/https://palewi.re/who-is-ben-welsh/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://palewi.re/who-is-ben-welsh/>; rel="timegate", <https://web.archive.org/web/20120325023811/http://palewi.re:80/who-is-ben-welsh/>; rel="first memento"; datetime="Sun, 25 Mar 2012 02:38:11 GMT", <https://web.archive.org/web/20200616224750/http://palewi.re/who-is-ben-welsh/>; rel="prev memento"; datetime="Tue, 16 Jun 2020 22:47:50 GMT", <https://web.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/>; rel="memento"; datetime="Wed, 17 Jun 2020 04:29:00 GMT", <https://web.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/>; rel="last memento"; datetime="Wed, 17 Jun 2020 04:29:00 GMT"',
        'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org",
        'X-Archive-Src': 'WPO-20200617041908-crawl892/WPO-20200617042026-07284.warc.gz',
        'Server-Timing': 'load_resource;dur=332.145265, captures_list;dur=487.869940, esindex;dur=0.015959, exclusion.robots.policy;dur=0.119242, PetaboxLoader3.resolve;dur=297.914476, exclusion.robots;dur=0.128564, PetaboxLoader3.datanode;dur=365.613138, LoadShardBlock;dur=396.345722, CDXLines.iter;dur=11.993722, RedisCDXSource;dur=73.516157',
        'X-App-Server': 'wwwb-app56',
        'X-ts': '200',
        'X-location': 'All',
        'X-Cache-Key': 'httpsweb.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/US',
        'X-Page-Cache': 'MISS',
        'Content-Encoding': 'gzip'
    }
}```
palewire commented 4 years ago

Okay. I've rewritten the core capture function to use the Link header and it appears to be working in my local branch.

https://github.com/pastpages/savepagenow/blob/0422475e509dbc9b254a820534d6f2c9cf77824e/savepagenow/api.py#L54-L76

I am going to do a little more testing, drop Python 2 support, tidy up a couple other things. If it all goes well I'll try to release a new version to PyPI this morning.

eggplants commented 4 years ago

LGTM. Thank you so much for your support... I'm waiting for new release.

eggplants commented 4 years ago

Are some issues (#14 && #16 && #23) going to be solved?

palewire commented 4 years ago

New version is out with the fix. Let me know if it doesn't work for you.

https://pypi.org/project/savepagenow/1.0.0/

eggplants commented 4 years ago

OK. This module seems to work well.

$ pip list | grep savepagenow
savepagenow                        1.0.0
$ eval "python3 -c 'from savepagenow import*;"{capture,capture_or_cache}"(\"http://www.example.com/\")';"|&cat>log
$ cat log
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/savepagenow/api.py", line 73, in capture
    raise CachedPage(msg)
savepagenow.exceptions.CachedPage: archive.org returned a cache of this page: https://web.archive.org/web/20200712165624/https://example.com/