Mirroring Web to IPFS - Githubissues

lidel commented 5 years ago

This is a meta-issue tracking related work and discussions (moved from https://github.com/ipfs-shipyard/ipfs-companion/issues/96).

Feasible

[x] Image Rehosting via HTTP API (ipfs-companion/#599)
[ ] Creating simplified website snapshot:
- ipfs-companion/#91
- 2read extension is a great poc!

More Design Work Required

Saving reproducible snapshot of entire page load

This includes all JS/CSS/XHR and other assets that were loaded by the page, respecting Origin and other constraints that could impact page load
The only standard in web archiving at the moment is the ISO WARC file format:
- https://www.iso.org/standard/68004.html, https://en.wikipedia.org/wiki/Web_ARChive
- it specifies raw data captured from the web. However, the WARC files often lack any context or metadata about how this data was captured
- WABAC.js proof-of-concept web archive replay system implemented entirely via Service Workers
- https://github.com/webrecorder/wabac.js hosted at https://wab.ac/
- supports replay of WARC and HAR files
- could probably be extended to support signed exchanges (below)
We don't have means to do it yet, but Bundles from WebPackage (https://github.com/ipfs/in-web-browsers/issues/121) could also unlock Archival use case:
- https://tools.ietf.org/html/draft-yasskin-webpackage-use-cases-01#section-2.2.10
:point_right: :point_right: Update 2021: High fidelity solution exists! And can be used with IPFS :partying_face:
- https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers
- https://replayweb.page/docs/
- Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0

Automatic mirroring of standard websites to IPFS as you browse them (https://github.com/ipfs-shipyard/ipfs-companion/issues/535)

IMMUTABLE assets: very limited feasibility, so far only two types of immutable resources on the web exist:
- JS, CSS etc marked with SRI hash (Subresource Integrity) (mapping SRI→CID) (see discussion from 2016-03-26 below, and https://github.com/ipfs/in-web-browsers/issues/214 for future work)
- URLs for things explicitly marked as immutable via Cache-Control: public, (..) immutable (mapping URL→CID)
MUTABLE assets: what if we we add every page to IPFS store mapping between URL and CID, then if page disappear, we could fallback to IPFS version?
- a can of worms: a safe version would be like web.archive.org, but limited to a local machine. Sharing cache with other people would require centralized mapping service (single point of failure, vector for privacy leaks)
  - So what is needed to make it "right"?
    - keep it simple but robust: no http, no centralization, no single point of failure
    - Ideally, URL2IPFS lookups would not rely on centralized index.
    - rough idea (https://github.com/ipfs-shipyard/ipfs-companion/issues/535#issuecomment-407046442): what if we create pubsub-based room per URL? for example:
      - When you open a website, you subscribe to pubsub room unique for that URL
      - If pubsub room has entries under "keepalive" treshold, grab the latest one
      - If room is empty or keepalive timeout is hit, fallback to HTTP, but in background add HTTP page to IPFS and announce updated hash on pubsub (with new timestamp) for next visitor
      - There are still pubsub performance and privacy problems to solve (eg. publishing banking pages), but at least we don't rely on HTTP server anymore.
        
        https://github.com/ipfs-shipyard/ipfs-companion/issues/535#issuecomment-407767407:
        
        I feel the safe way to do it to just follow semantics of Cache-Control and max-age (if present). This header is already respected by browsers and website owners and could be parsed as indicator if specific asset can be cached in IPFS. AFAIK all browsers (well, at least Chrome, Firefox) cache HTTPS content by default for some arbitrary time (if Cache-Control is missing), unless explicitly told not to cache via Cache-Control header.
Other notes
- "webpackage" standard proposal surfaced recently, among other things, it aims to address website snapshoting use case in a safe and reproducible manner:
  - webpackage: Save and share a web page (Use Case)
    - Sounds super relevant to what we want as the endgame here
Prior art: existing browser extensions
- Arweave: https://chrome.google.com/webstore/detail/arweave/iplppiggblloelhoglpmkmbinggcaaoc?hl=en-GB
- Archiveror: https://chrome.google.com/webstore/detail/archiveror/cpjdnekhgjdecpmjglkcegchhiijadpb
- https://github.com/inkandswitch/xcrpt - PoC browser extension produces page snapshot with a note at the top of the page publishes it to IPFS
  - Demo: https://www.bonappetit.com/recipe/kimchi-jjigae gets saved as https://ipfs.io/ipfs/QmS1pj7nUBvCSTaMjSrtH1EYfhWWpr4sZFyfdi7zfAm5Wc/

Related Discussions

2016-03-26

IRC log about mirroring SRI2IPFS

``` 165958 geir_ │ lgierth: The web sites would have to link to ipfs content for this plugin to work. What i propose is a proxy that works like a transparent proxy and puts content into ipfs if it's not already there 170124 ed_t │ anyone know anything about ipfs-boards 170141 ed_t │ it keeps telling me I am in limited mode 170202 ed_t │ a full ipfs 0.40-rc3 node is running on localhost:5001 170217 ed_t │ but it does not seem to see it using the demo link 170228 +lgierth │ geir_: ah got what you wanna do -- i'm not sure you can easily just rewrite anything 170253 +lgierth │ for completely static pages, yes, but for slightly more dynamic stuff? 170303 +lgierth │ i'll be back in a bit, getting some coffee 170422 geir_ │ lgierth: I mean only for the static stuff like images, libs and so on. Should be pretty strait forward to implement. And a big bandwidth save for big networks 171542 lidel │ geir_, we are planning to add "host to ipfs" feature to the addon 171614 lidel │ when that is done, it should be easy to add option to automatically add every visited page 171634 lidel │ not sure how addon would do lookups tho 171734 lidel │ (meaning, how do i know the multihash of the page, how do we handle ipfs-cache expiration when page gets updated, etc) 171831 geir_ │ lidel: I see, thanks for the info. I still like the idea of a transparent proxy so every user/device on the network will use the "cdn" automatically 171852 lidel │ perhaps we could start with mirroring static assets that have SRI hash (https://www.srihash.org/) 171920 lidel │ and come up with a way for doing SRI2IPFS lookups ```

2015+

IPFS as a backend to a web archiving - https://github.com/ipfs/archives/issues/28

2018-01-14

https://discuss.ipfs.io/t/web-browser-with-integrated-ipfs-node-support-for-browser-cache/1799/5

2018-03-08

[Suggestion] : IPFS browser extension as lite-node? https://github.com/ipfs/ipfs/issues/310

2018-07-09

https://discuss.ipfs.io/t/mirroring-standard-websites-to-ipfs-as-you-browse-them/3355

2018-07-23

http->ipfs translator proposal https://github.com/ipfs-shipyard/ipfs-companion/issues/535
webpackage standard draft
- https://github.com/WICG/webpackage/blob/master/explainer.md#save-and-share-a-web-page
- https://wicg.github.io/webpackage/draft-yasskin-webpackage-use-cases.html#snapshot

LoveIsGrief commented 5 years ago

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

mitra42 commented 5 years ago

Sure, I'd be happy to talk. - dweb.archive.org doesn't do it for web pages (yet) but does mirror some of the content accessed through dweb-gateway to the IPFS http api. (Not all of it, because of the combination of IPFS losing data, and no error result/fallback when it cant find something).

Note that we also use urlstore as our primary mirroring mechanism, because we have the opposite concern to you, i.e. that we can't replicate 50 peta-bytes, so just push the reference so that the most used items will get mirrored by IPFS, and an upcoming version will also pull items via IPFS as alternative to a direct fetch from the archive.

I also wrote dweb.mirror which is a crawler, specialized to crawl archive.org items (not wayback machine yet) and that mirrors everything to IPFS.

jimpick commented 5 years ago

I'll be going to csv,conf next week. It will be another chance to talk more with @ikreymer, who is giving a talk on WARC files: https://csvconf.com/speakers/#ilya-kreymer

RubenKelevra commented 4 years ago

It might be interesting to talk to https://archive.fo/ and https://archive.org who might have already written something very similar.

How about asking archive.org if we could help them by cooperating, I'm sure they have issues with crawling capacity?

Archive.org could provide data in ipfs when a given URL has been captured. If this is some days ago, we could ask the user, if he likes to capture the URL (since he might be logged in or personal information is currently inserted in a form or similar). If he agrees we share the snapshot in IPFS (somehow - I have no idea how this would technically work to make it locatable by URL and timestamp). archive.org could pin it or download it, for displaying it on their website.

ikreymer commented 4 years ago

Hi, I've just recently launched https://replayweb.page/ (https://github.com/webrecorder/replayweb.page) which is a full browser-based web archive replay system ('wayback machine'), using service workers. The system can load web archives from a variety of locations, and could be expanded to support IPFS.

In fact, it can trivially work using an IPFS gateway already: https://gateway.pinata.cloud/ipfs/QmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX/?source=https%3A%2F%2Fgateway.pinata.cloud%2Fipfs%2FQmS8pUs87xvn13yymY7JFLoKfUyL2sYvFL73Mz86XFi8XX%2Fexamples%2Fnetpreserve-twitter.warc#view=replay&url=https%3A%2F%2Ftwitter.com%2Fnetpreserve&ts=20190603053135

It should be possible to extend to support ipfs:// urls, or perhaps using the gateway could work as well (though cloudflare specifically does not allow service workers).

ReplayWeb.page is the latest tool from Webrecorder, here's also a blog post announcing it: https://webrecorder.net/2020/06/11/webrecorder-conifer-and-replayweb-page.html#introducing-replaywebpage

lidel commented 3 years ago

Relevant demo/status update of @ikreymer's work: https://www.youtube.com/watch?v=evcSETnTBf0

RubenKelevra commented 3 years ago

This proposal touches this topic:

https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4

ipfs / in-web-browsers

Mirroring Web to IPFS #94

Feasible

More Design Work Required

Saving reproducible snapshot of entire page load

Automatic mirroring of standard websites to IPFS as you browse them (https://github.com/ipfs-shipyard/ipfs-companion/issues/535)

Related Discussions