Closed eggplants closed 4 years ago
@palewire Is this repo no longer maintained?
For last two days I am also facing this errors. From another service where I am not using this library, also failing . So I think wayback machine changed something from their end, its not library issue.
Should be fixed.
I think archive_id originally received an url like '/web/20070810113028/http://www.knams.wikimedia.org/wikimania/highquality/Wikimania05-AP1.avi'. The key of this link seem to have been moved to X-Cache-Key.
Should be changed into:
archive_id = response.headers['X-Cache-Key'][20::]
Hey, I'm here. Just been a busy few months at my day job with COVID. Clearly something has changed on the archive.org end and I need to catch up on it.
Here's a guess. My module is expecting the content-location
header to be present in responses from Wayback regardless of whether or now they have been newly archived or are, as in the case of your header dump above, pulled from the cache (as indicated by "X-Page-Cache":"HIT"
.
You can see that here with the exception test for content-location
coming before the cache header check. So if that first try fails, you never get around to testing if its cached.
A theory: content-location was dropped from cached responses and this bug could be fixed by putting the cache check first.
Flipping those code blocks around seems to have allowed me to retrieve cached URLs again
>> import savepagenow
>>> savepagenow.capture("https://www.google.com/")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/palewire/Code/savepagenow/savepagenow/api.py", line 62, in capture
raise CachedPage("archive.org returned a cached version of this page: {}".format(
savepagenow.exceptions.CachedPage: archive.org returned a cached version of this page: https://web.archive.org/web/20200712162252/https://www.google.com/US
But when I tried my own personal blog, which is rarely cached, I see that the content-location
header is no longer appears there either.
>>> savepagenow.capture("https://palewi.re/who-is-ben-welsh/")
Traceback (most recent call last):
File "/home/palewire/Code/savepagenow/savepagenow/api.py", line 69, in capture
archive_id = response.headers['Content-Location']
File "/home/palewire/.local/share/virtualenvs/savepagenow-3AbMKd12/lib/python3.8/site-packages/requests/structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'content-location'
Here's the cleaned up headers coming back from Wayback on that second request. The 'X-Page-Cache': 'MISS',
header which makes me think it was archived. And based on my rough reading I suppose https://web.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/
is returned as the new memento url. I'm not sure I trust that cache key for parsing. If you agree that's the right URL to grab, you think we could parse it out of the Link tag?
{
'status_code': 200,
'headers': {
'Server': 'nginx/1.15.8',
'Date': 'Sun, 12 Jul 2020 16:42:30 GMT',
'Content-Type': 'text/html; charset=utf-8',
'Transfer-Encoding': 'chunked',
'Connection': 'keep-alive',
'X-Archive-Orig-Date': 'Wed, 17 Jun 2020 04:29:00 GMT',
'X-Archive-Orig-Connection': 'close',
'X-Archive-Orig-Server': 'Apache/2.4.10 (Ubuntu) mod_wsgi/3.5 Python/2.7.8',
'X-Archive-Orig-X-Varnish-TTL': '180s',
'X-Archive-Orig-X-Varnish': '15974906',
'X-Archive-Orig-Age': '0',
'X-Archive-Orig-Via': '1.1 varnish-v4',
'X-Archive-Orig-grace': 'none',
'X-Archive-Orig-X-Varnish-Cache': 'HIT',
'X-Archive-Orig-Accept-Ranges': 'bytes',
'X-Archive-Guessed-Content-Type': 'text/html',
'X-Archive-Guessed-Charset': 'utf-8',
'Memento-Datetime': 'Wed, 17 Jun 2020 04:29:00 GMT',
'Link': '<https://palewi.re/who-is-ben-welsh/>; rel="original", <https://web.archive.org/web/timemap/link/https://palewi.re/who-is-ben-welsh/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://palewi.re/who-is-ben-welsh/>; rel="timegate", <https://web.archive.org/web/20120325023811/http://palewi.re:80/who-is-ben-welsh/>; rel="first memento"; datetime="Sun, 25 Mar 2012 02:38:11 GMT", <https://web.archive.org/web/20200616224750/http://palewi.re/who-is-ben-welsh/>; rel="prev memento"; datetime="Tue, 16 Jun 2020 22:47:50 GMT", <https://web.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/>; rel="memento"; datetime="Wed, 17 Jun 2020 04:29:00 GMT", <https://web.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/>; rel="last memento"; datetime="Wed, 17 Jun 2020 04:29:00 GMT"',
'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org",
'X-Archive-Src': 'WPO-20200617041908-crawl892/WPO-20200617042026-07284.warc.gz',
'Server-Timing': 'load_resource;dur=332.145265, captures_list;dur=487.869940, esindex;dur=0.015959, exclusion.robots.policy;dur=0.119242, PetaboxLoader3.resolve;dur=297.914476, exclusion.robots;dur=0.128564, PetaboxLoader3.datanode;dur=365.613138, LoadShardBlock;dur=396.345722, CDXLines.iter;dur=11.993722, RedisCDXSource;dur=73.516157',
'X-App-Server': 'wwwb-app56',
'X-ts': '200',
'X-location': 'All',
'X-Cache-Key': 'httpsweb.archive.org/web/20200617042900/https://palewi.re/who-is-ben-welsh/US',
'X-Page-Cache': 'MISS',
'Content-Encoding': 'gzip'
}
}```
Okay. I've rewritten the core capture function to use the Link
header and it appears to be working in my local branch.
I am going to do a little more testing, drop Python 2 support, tidy up a couple other things. If it all goes well I'll try to release a new version to PyPI this morning.
LGTM. Thank you so much for your support... I'm waiting for new release.
Are some issues (#14 && #16 && #23) going to be solved?
New version is out with the fix. Let me know if it doesn't work for you.
OK. This module seems to work well.
$ pip list | grep savepagenow
savepagenow 1.0.0
$ eval "python3 -c 'from savepagenow import*;"{capture,capture_or_cache}"(\"http://www.example.com/\")';"|&cat>log
$ cat log
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/savepagenow/api.py", line 73, in capture
raise CachedPage(msg)
savepagenow.exceptions.CachedPage: archive.org returned a cache of this page: https://web.archive.org/web/20200712165624/https://example.com/
Both
capture
andcapture_or_cache
fail and return errors.capture
:capture_or_cache
: