ipfs / in-web-browsers

Tracking the endeavor towards getting web browsers to natively support IPFS and content-addressing
https://docs.ipfs.tech/how-to/address-ipfs-on-web/
MIT License
344 stars 29 forks source link

Mirroring Web to IPFS #94

Open lidel opened 5 years ago

lidel commented 5 years ago

This is a meta-issue tracking related work and discussions (moved from https://github.com/ipfs-shipyard/ipfs-companion/issues/96).

Feasible

More Design Work Required

Saving reproducible snapshot of entire page load

Automatic mirroring of standard websites to IPFS as you browse them (https://github.com/ipfs-shipyard/ipfs-companion/issues/535)


Related Discussions

2016-03-26

IRC log about mirroring SRI2IPFS ``` 165958 geir_ │ lgierth: The web sites would have to link to ipfs content for this plugin to work. What i propose is a proxy that works like a transparent proxy and puts content into ipfs if it's not already there 170124 ed_t │ anyone know anything about ipfs-boards 170141 ed_t │ it keeps telling me I am in limited mode 170202 ed_t │ a full ipfs 0.40-rc3 node is running on localhost:5001 170217 ed_t │ but it does not seem to see it using the demo link 170228 +lgierth │ geir_: ah got what you wanna do -- i'm not sure you can easily just rewrite anything 170253 +lgierth │ for completely static pages, yes, but for slightly more dynamic stuff? 170303 +lgierth │ i'll be back in a bit, getting some coffee 170422 geir_ │ lgierth: I mean only for the static stuff like images, libs and so on. Should be pretty strait forward to implement. And a big bandwidth save for big networks 171542 lidel │ geir_, we are planning to add "host to ipfs" feature to the addon 171614 lidel │ when that is done, it should be easy to add option to automatically add every visited page 171634 lidel │ not sure how addon would do lookups tho 171734 lidel │ (meaning, how do i know the multihash of the page, how do we handle ipfs-cache expiration when page gets updated, etc) 171831 geir_ │ lidel: I see, thanks for the info. I still like the idea of a transparent proxy so every user/device on the network will use the "cdn" automatically 171852 lidel │ perhaps we could start with mirroring static assets that have SRI hash (https://www.srihash.org/) 171920 lidel │ and come up with a way for doing SRI2IPFS lookups ```

2015+

2018-01-14

2018-03-08

2018-07-09

2018-07-23

LoveIsGrief commented 5 years ago

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

mitra42 commented 5 years ago

Sure, I'd be happy to talk. - dweb.archive.org doesn't do it for web pages (yet) but does mirror some of the content accessed through dweb-gateway to the IPFS http api. (Not all of it, because of the combination of IPFS losing data, and no error result/fallback when it cant find something).

Note that we also use urlstore as our primary mirroring mechanism, because we have the opposite concern to you, i.e. that we can't replicate 50 peta-bytes, so just push the reference so that the most used items will get mirrored by IPFS, and an upcoming version will also pull items via IPFS as alternative to a direct fetch from the archive.

I also wrote dweb.mirror which is a crawler, specialized to crawl archive.org items (not wayback machine yet) and that mirrors everything to IPFS.

jimpick commented 5 years ago

I'll be going to csv,conf next week. It will be another chance to talk more with @ikreymer, who is giving a talk on WARC files: https://csvconf.com/speakers/#ilya-kreymer

RubenKelevra commented 4 years ago

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

How about asking archive.org if we could help them by cooperating, I'm sure they have issues with crawling capacity?

Archive.org could provide data in ipfs when a given URL has been captured. If this is some days ago, we could ask the user, if he likes to capture the URL (since he might be logged in or personal information is currently inserted in a form or similar). If he agrees we share the snapshot in IPFS (somehow - I have no idea how this would technically work to make it locatable by URL and timestamp). archive.org could pin it or download it, for displaying it on their website.

ikreymer commented 4 years ago

Hi, I've just recently launched https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers. The system can load web archives from a variety of locations, and could be expanded to support IPFS.

In fact, it can trivially work using an IPFS gateway already: https://gateway.pinata.cloud/ipfs/QmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX/?source=https%3A%2F%2Fgateway.pinata.cloud%2Fipfs%2FQmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX%2Fexamples%2Fnetpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135

It should be possible to extend to support ipfs:// urls, or perhaps using the gateway could work as well (though cloudflare specifically does not allow service workers).

ReplayWeb.page is the latest tool from Webrecorder, here's also a blog post announcing it: https://webrecorder.net/2020/06/11/webrecorder-conifer-and-replayweb-page.html#introducing-replaywebpage

lidel commented 3 years ago

Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0

RubenKelevra commented 3 years ago

This proposal touches this topic:

https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4