internetarchive / dweb-archive

GNU Affero General Public License v3.0
56 stars 16 forks source link

Adding Web items (Wayback) #87

Open mitra42 opened 6 years ago

mitra42 commented 6 years ago

General idea is to add Wayback items / web archives to dweb.archive.org

Non trivial as uses different formats etc to rest of archive.org

See https://github.com/mitra42/dweb-universal/issues/2

Notes follow ... See also Notebook pg 21

mitra42 commented 6 years ago

Overview (from call with Mark 2018-08-31 of IA internals.

Process starts with URL, looks up in CDX to get what captures we have, displays on UI User selects dates, then we get from WARC

On command line ([ ] need to find a place I can run this, it doesnt work on dweb.me) cdx http://www.google.com/ -p from=20180113 -p to=20180113 cdx http://www.google.com/ -p from=20180130 -p to=20180130 --fl timestamp

First finds where on Petabox, then what Warc file, then offset and range (into compressed zip.

mitra42 commented 6 years ago

Documentation … (from call with Mark 2018-08-31 of IA internals.

Also try

mitra42 commented 6 years ago

Notes CDX - Can be expensive for full date range … may be large for popular sites like www.google.com or www.cnn.com but ok give volume of traffic anyway

mitra42 commented 6 years ago

Possible solutions …

Will need to take a page at a date and push that into Dweb, could either do IPFS of whole thing or IPFS of a new Warc made of the files needed. Latter is probably harder as would add duplication. Solution might be to feed each file into IPFS urlstore - remember IPFS hashes in gateway REDIS for now - return ipfs hash of HTML

mitra42 commented 6 years ago

Notes on Memento Memento Web is federated search on top of CDX. There I a service with an API http://timetravel.mementoweb.org searchs IA, British Library and few others, federated not decentralized http://timetravel.mementoweb.org/guide/api/ https://www.cs.odu.edu/~mln/ is the expert (Mark can intro)

mitra42 commented 5 years ago

Notes as try to get my head around this beast! FROM: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

http://web.archive.org/cdx/search/cdx?url=mitra.biz&fastLatest=true&limit=-1&filter=statuscode:200

Gets the most recent successfull capture of mitra.biz

mitra42 commented 5 years ago

Notes from meeting with Kenji today .... https://archive.org/wayback/available?url=www.mitra.biz&statuslist=200,302 Gets URL of most recent 302/200 Today "url" is "http://web.archive.org/web/20190624190417/https://www.mitra.biz/" Via curl (not via browser) gets header curl -v -o- 'http://web.archive.org/web/20190624190417/https://www.mitra.biz/' Gets 302 to curl -v -o- http://web.archive.org/web/20190624190417/https://www.mitra.biz/index.html Gets html with links munged

Kenji has experimental headless browser service that returns the DOM once these are "played" will send me a URL

mitra42 commented 5 years ago

And here are some old notes form a slack convo in Nov

https://waybackrebuilder.com http://waybackdownloader.com http://www.waybackmachinedownloader.com/en/ https://www.waybackdownloads.com

https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server