Open mitra42 opened 6 years ago
Overview (from call with Mark 2018-08-31 of IA internals.
Process starts with URL, looks up in CDX to get what captures we have, displays on UI User selects dates, then we get from WARC
On command line ([ ] need to find a place I can run this, it doesnt work on dweb.me)
cdx http://www.google.com/ -p from=20180113 -p to=20180113
cdx http://www.google.com/ -p from=20180130 -p to=20180130 --fl timestamp
First finds where on Petabox, then what Warc file, then offset and range (into compressed zip.
Documentation … (from call with Mark 2018-08-31 of IA internals.
Also try
Notes CDX - Can be expensive for full date range … may be large for popular sites like www.google.com or www.cnn.com but ok give volume of traffic anyway
Possible solutions …
Will need to take a page at a date and push that into Dweb, could either do IPFS of whole thing or IPFS of a new Warc made of the files needed. Latter is probably harder as would add duplication. Solution might be to feed each file into IPFS urlstore - remember IPFS hashes in gateway REDIS for now - return ipfs hash of HTML
Notes on Memento Memento Web is federated search on top of CDX. There I a service with an API http://timetravel.mementoweb.org searchs IA, British Library and few others, federated not decentralized http://timetravel.mementoweb.org/guide/api/ https://www.cs.odu.edu/~mln/ is the expert (Mark can intro)
Notes as try to get my head around this beast! FROM: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
http://web.archive.org/cdx/search/cdx?url=mitra.biz&fastLatest=true&limit=-1&filter=statuscode:200
Gets the most recent successfull capture of mitra.biz
Notes from meeting with Kenji today .... https://archive.org/wayback/available?url=www.mitra.biz&statuslist=200,302 Gets URL of most recent 302/200 Today "url" is "http://web.archive.org/web/20190624190417/https://www.mitra.biz/" Via curl (not via browser) gets header curl -v -o- 'http://web.archive.org/web/20190624190417/https://www.mitra.biz/' Gets 302 to curl -v -o- http://web.archive.org/web/20190624190417/https://www.mitra.biz/index.html Gets html with links munged
Kenji has experimental headless browser service that returns the DOM once these are "played" will send me a URL
And here are some old notes form a slack convo in Nov
https://waybackrebuilder.com http://waybackdownloader.com http://www.waybackmachinedownloader.com/en/ https://www.waybackdownloads.com
https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
General idea is to add Wayback items / web archives to dweb.archive.org
Non trivial as uses different formats etc to rest of archive.org
See https://github.com/mitra42/dweb-universal/issues/2
Notes follow ... See also Notebook pg 21