Open lidel opened 5 years ago
It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.
Sure, I'd be happy to talk. - dweb.archive.org doesn't do it for web pages (yet) but does mirror some of the content accessed through dweb-gateway to the IPFS http api. (Not all of it, because of the combination of IPFS losing data, and no error result/fallback when it cant find something).
Note that we also use urlstore as our primary mirroring mechanism, because we have the opposite concern to you, i.e. that we can't replicate 50 peta-bytes, so just push the reference so that the most used items will get mirrored by IPFS, and an upcoming version will also pull items via IPFS as alternative to a direct fetch from the archive.
I also wrote dweb.mirror which is a crawler, specialized to crawl archive.org items (not wayback machine yet) and that mirrors everything to IPFS.
I'll be going to csv,conf next week. It will be another chance to talk more with @ikreymer, who is giving a talk on WARC files: https://csvconf.com/speakers/#ilya-kreymer
It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.
How about asking archive.org if we could help them by cooperating, I'm sure they have issues with crawling capacity?
Archive.org could provide data in ipfs when a given URL has been captured. If this is some days ago, we could ask the user, if he likes to capture the URL (since he might be logged in or personal information is currently inserted in a form or similar). If he agrees we share the snapshot in IPFS (somehow - I have no idea how this would technically work to make it locatable by URL and timestamp). archive.org could pin it or download it, for displaying it on their website.
Hi, I've just recently launched https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers. The system can load web archives from a variety of locations, and could be expanded to support IPFS.
In fact, it can trivially work using an IPFS gateway already: https://gateway.pinata.cloud/ipfs/QmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX/?source=https%3A%2F%2Fgateway.pinata.cloud%2Fipfs%2FQmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX%2Fexamples%2Fnetpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135
It should be possible to extend to support ipfs://
urls, or perhaps using the gateway could work as well (though cloudflare specifically does not allow service workers).
ReplayWeb.page is the latest tool from Webrecorder, here's also a blog post announcing it: https://webrecorder.net/2020/06/11/webrecorder-conifer-and-replayweb-page.html#introducing-replaywebpage
Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0
This proposal touches this topic:
https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4
This is a meta-issue tracking related work and discussions (moved from https://github.com/ipfs-shipyard/ipfs-companion/issues/96).
Feasible
More Design Work Required
Saving reproducible snapshot of entire page load
Automatic mirroring of standard websites to IPFS as you browse them (https://github.com/ipfs-shipyard/ipfs-companion/issues/535)
IMMUTABLE assets: very limited feasibility, so far only two types of immutable resources on the web exist:
Cache-Control: public, (..) immutable
(mapping URL→CID)MUTABLE assets: what if we we add every page to IPFS store mapping between URL and CID, then if page disappear, we could fallback to IPFS version?
Other notes
Prior art: existing browser extensions
Related Discussions
2016-03-26
IRC log about mirroring SRI2IPFS
``` 165958 geir_ │ lgierth: The web sites would have to link to ipfs content for this plugin to work. What i propose is a proxy that works like a transparent proxy and puts content into ipfs if it's not already there 170124 ed_t │ anyone know anything about ipfs-boards 170141 ed_t │ it keeps telling me I am in limited mode 170202 ed_t │ a full ipfs 0.40-rc3 node is running on localhost:5001 170217 ed_t │ but it does not seem to see it using the demo link 170228 +lgierth │ geir_: ah got what you wanna do -- i'm not sure you can easily just rewrite anything 170253 +lgierth │ for completely static pages, yes, but for slightly more dynamic stuff? 170303 +lgierth │ i'll be back in a bit, getting some coffee 170422 geir_ │ lgierth: I mean only for the static stuff like images, libs and so on. Should be pretty strait forward to implement. And a big bandwidth save for big networks 171542 lidel │ geir_, we are planning to add "host to ipfs" feature to the addon 171614 lidel │ when that is done, it should be easy to add option to automatically add every visited page 171634 lidel │ not sure how addon would do lookups tho 171734 lidel │ (meaning, how do i know the multihash of the page, how do we handle ipfs-cache expiration when page gets updated, etc) 171831 geir_ │ lidel: I see, thanks for the info. I still like the idea of a transparent proxy so every user/device on the network will use the "cdn" automatically 171852 lidel │ perhaps we could start with mirroring static assets that have SRI hash (https://www.srihash.org/) 171920 lidel │ and come up with a way for doing SRI2IPFS lookups ```2015+
2018-01-14
2018-03-08
2018-07-09
2018-07-23