Automatic mirroring of HTTP websites to IPFS as you browse them

ipfs / ipfs-companion

Browser extension that simplifies access to IPFS resources on the web

https://docs.ipfs.tech/install/ipfs-companion/

Creative Commons Zero v1.0 Universal

2.07k stars 324 forks source link

Automatic mirroring of HTTP websites to IPFS as you browse them #535

Open skliarie opened 6 years ago

skliarie commented 6 years ago

By design IPFS provides excellent caching and accelerating mechanism. It would be nice to use IPFS as HTTP accelerator. Obviously, it might be used for static files only.

My proposal is as follows:

Create secure hash storing website "IPFS2HTTP". It would store URL, IPFS hash, expiration time. 1.1 Have api to get URL and return IPFS hash, expiration time 1.2 Have "workers" to retrieve unknown or expired URLs, add it to IPFS and keep hash in internal db.
Add to ipfs-companion option to enable "http2ipfs" translation, in following way: 2.1 For each never-seen-before URL do this (with 5s timeout): 2.1.1 Go to the "IPFS2HTTP" site, pass the URL, see if it has hash for the URL. Get the hash, fetch it over IPFS, return to user. Fall back to HTTP if no such hash. 2.2. If URL already seen - check its time to live (as received from IPFS2HTTP site) and use IPFS hash if possible. Otherwise retrieve new hash as in 2.1.1

Initially we might use the acceleration for obviously static files, such as .iso, .mp3, images...

I volunteer to do the part 1. Can you help with 2? What do you think?

ivan386 commented 6 years ago

You can use https://hash-archive.org/

skliarie commented 6 years ago

No, I can not use hash-archive.org as it has no necessary API. Also, the implementation requires close cooperation with the site owner, and currently I don't see hash-archive doing this. I will drop them a note though.

ivan386 commented 6 years ago

Api not documented: https://github.com/btrask/hash-archive/blob/9d97fb7b87674094ff36b4bf4ed1f46f05f734fa/src/server.c#L129

/api/enqueue/{url}
/api/history/{url}
/api/sources/{hash}

lidel commented 6 years ago

obviously static files, such as .iso, .mp3, images...

Sadly the only "static" files are ones with SRI hash or Cache-Control: public, (..) immutable header. Everything else on HTTP-based web is implicitly mutable (what I mean by that is we have to assume it can change between two requests).

@skliarie Some notes in on this can be found at https://github.com/ipfs-shipyard/ipfs-companion/issues/96#issue-143726690, re-pasting them here:

Automatic mirroring of resources

IMMUTABLE assets: very limited feasibility, so far only two types of immutable resources on the web exist:

JS, CSS etc marked with SRI hash (Subresource Integrity) (mapping SRI→CID)

URLs for things explicitly marked as immutable via Cache-Control: public, (..) immutable (mapping URL→CID)

MUTABLE assets: what if we we add every page to IPFS store mapping between URL and CID, then if page disappear, we could fallback to IPFS version?

a can of worms: a safe version would be like web.archive.org, but limited to a local machine, but sharing cache with other people would require centralized mapping service (single point of failure, vector for privacy leaks)

To reiterate, mapping mutable content under some URL to IPFS CID requires a centralized index which introduces huge vector for privacy leaks and a single point of failure. Not to mention index will be basically DDoSed if our user base grows too fast, and if we base lookups on HTTP requests that will degrade browsing performance even further. IMO not worth investing time with those known downsides, plus it sends really mixed-message of decentralizing using centralized server.

However, I like the fact this idea (mapping URL2IPFS) comes back every few months, which means there is some potential to it.

So what is needed to make it "right"?

keep it simple but robust: no http, no centralization, no single point of failure
Ideally, URL2IPFS lookups would not rely on centralized index. What if we create pubsub-based room per URL, for example:
- When you open a website, you subscribe to pubsub room unique for that URL
  - If pubsub room has entries under "keepalive" treshold, grab the latest one
  - If room is empty or keepalive timeout is hit, fallback to HTTP, but in background add HTTP page to IPFS and announce updated hash on pubsub (with new timestamp) for next visitor

There are still pubsub performance and privacy problems to solve (eg. publishing banking pages), but at least we don't rely on HTTP server anymore. :)

skliarie commented 6 years ago

The whole http2ipfs ticket is essentially exercise in trust: who you are going to believe to accept HTTP pre-calculated hash from. BTW, same question must arise when choosing initial seed servers to connect to. May be they have a solution that we can reuse..

I think we should adopt self-hosted/trust building hybrid approach. Something along the lines:

Every ipfs node might choose to run http2ipfs gateway. When doing so, it must do following:
hash the URL, lookup in local node directory. ** The node directory is publically accessible, in form of HASHDIRECTORY/HASH_OF_THE_URL
lookup the hashed URL among trustworthy nodes (see below).
if URL HASH found, get IPFS hash from it, retrieve the IPFS hash and return it to IC (IPFS companion)
retrieve content of the http URL in question If http headers show that URL is not cacheable, mark it as so in local node directory return the result to IC
store in node directory result. (calculate hash for the content if cacheable)
store content locally if possible (n/p if not)
subscribe to the room for the URL using content hash received. Announce the URL that was used (in hashed+salt) form (to prevent "me-too" attackers).
collect nodes that announced same URL as you, note them down as "trustworthy".

ipfs-companion changes:

intercept URLs that are likely to have "Cache-Control: public" headers (.iso, .avi, etc).
w/o going to the HTTP server, check with local node if it is cacheable, retrieve directly if not.

skliarie commented 6 years ago

I would like to add couple clarifications to the idea:

depending on headers and extension we might categorize URLs into 3 groups: definitely static, most likely, undefined, dynamic. -- definitely static: headers="Cache-Control: public" or SRI hash is present -- most likely static: hash of the url have not been changed for long time (month?), hash confirmed my multiple nodes (20?). -- undefined: waiting area, see below -- dynamic: URLs that have been defined as dynamic by "undefined" group.

The undefined group: URLs with content mime type or url extension that have "possibly static" properties (.iso, .png, etc) - categorized as "most likely static" group. Same way, URLs with short caching time are categorized into "dynamic" group. Not sure whether URLs of dynamic nature (.php, .asp, etc) should be categorized into "dynamic" group. Same goes for sites with Basic Auth protection.

The ipfs-companion should have toggle (button?) on whether to treat the group URLs in "undefined" as dynamic or "most likely static". This way users with bad rendering (as result of stale content) might refresh the page "properly". IMHO this should be done on per-site basis, as I doubt there are many such "dumb" sites (that use static format for dynamic results). When this toggle is activated, the URL (or all "maybe static" URLs) of the site are marked as dynamic and published in the corresponding pubsub room(s).

URLs in undefined group are not "published" in pubsub rooms, only "dynamic" and "most likely static" ones.

For security sensitive sites, there should be toggle to mark all URLs as "dynamic" ones. The toggle might ask for confirmation whether to "disable IPFS caching for the whole https://bankofamerica.com site".

lidel commented 6 years ago

definitely static: headers="Cache-Control: public" or SRI hash is present

I think you meant Cache-Control: immutable (public is not even "most likely static", as it does not guarantee content will not change at some point in future, it just says it is safe for current version to be cached for a time period specified in max-age attribute)

I doubt there are many such "dumb" sites (that use static format for dynamic results)

Statically-generated blogs are a thing, and there are basically "static" websites generated on the fly by PHP or ASP. And a lot of websites hide .php or .aspx from URLs and hide vendor-specific headers, so you don't even know what is happening behind the scenes. All you can do is follow hints from generic HTTP headers such as Cache-Control and Etag.

The risk of breaking websites by over-caching is extremely high and writing custom heuristics is a maintenance hell prone to false-positives.

I feel the safe way to do it to just follow semantics of Cache-Control and max-age (if present). This header is already respected by browsers and website owners and could be parsed as indicator if specific asset can be cached in IPFS. AFAIK all browsers (well, at least Chrome, Firefox) cache HTTPS content by default for some arbitrary time (if Cache-Control is missing), unless explicitly told not to cache via Cache-Control header.

The ipfs-companion should have toggle (button?) on whether to [..] refresh the page "properly".

Agreed, there should always be way to exclude website from mirroring experiment.

For security sensitive sites, there should be toggle to mark all URLs as "dynamic" ones. The toggle might ask for confirmation whether to "disable IPFS caching for the whole https://bankofamerica.com site".

I am afraid manual opt-out is not a safe way to do it.
Sensitive page already leak into the public on initial load, it is too late to mark it as "do not leak this" after the fact.

Some open problems related to security:

HTTPS is a hard problem in general. Most of websites run it. How do you distinguish which assets are safe to share with other people? Are there any semantics in HTTP that let you do that in a safe way?
- My bank uses Cache-Control: no-cache, no-store, must-revalidate, max-age = 0 to indicate that sensitive page/asset should not be cached. But is it enough? I suspect there are websites who allow caching of sensitive content because it improves website speed and they assume cache is the the browser cache that does not leave user's machine.
Cache poisoning attacks: without additional safeguards this is just a cool demo/experiment. In real life someone can manually create and publish a fake snapshot for a website, perform Sybil attack to make it look legit even if you naively look for basic form of quorum etc.

skliarie commented 6 years ago

Lets clarify the groups according per "shareability" attribute:

definitely static: content with Cache-Control: immutable header. Those could be shared immediately.
most likely static (hash of the url have not been changed for long time (month?), hash confirmed by multiple nodes (20+): These will not be shared unless pubsub room satisfies both rules. BTW, this will automatically exclude content that differs for different users and thus might be unsafe to share.
unknown and dynamic - the content is never shared. For explicitly dynamic URLs, we publish status "dynamic" in pubsub room with hash of the URL.

After that point our concern is only with content that many (more than 20 users) have access to, but still is sensitive (protected by password) or restricted by network (see below).

For that, the "dynamic" toggle button I mentioned above is used. Once clicked, it should prompt user with selectable list of top-level domains seen on the page. The user then selects domains that must be marked as "dynamic". This will open pubsub rooms with hash of domains and publish the domains as "dynamic" and thus never cacheable.

Need to think what to do with evil actors that want to disable IPFS cache for foreign (competitor?) site.. Or someone that does not have a clue and selects all of them...

With such conservative caching approach, I don't see possible harm done, do you?

lidel commented 6 years ago

I am a bit skeptical about finding an easy fix that protects users from sharing secrets or being fed invalid content by bad actors that can spawn multiple nodes. Those are complex problems that should not be underestimated.

We should play it safe and start with a simple opt-in experiment that has a smaller number of moving pieces and is easier to reason about:

(when enabled) does "IPFS caching" only for "definitely static" (Cache-Control: immutable and things with SRI that don't have caching disabled via Cache-Control)
(and / or) lets user to "whitelist" specific hostnames as an explicit opt-in (similar to what you described, we could support * wildcard, but user should make conscious decision to set that)
follows Cache-Control semantics present in web browsers, namely respects max-age set by website and has the same defaults when Cache-Control is missing.

That being said, some potential blockers:

We need pubsub to be enabled by default, and it is still an opt-in experiment that requires manual configuration of your node
- It remains to be seen how does performance of pubsub look like for subscribing → listening → unsubscribing per every URL, do timeouts work as expected etc. My intuition tells me such pubsub-based lookup (with where things currently are) may degrade browsing performance in a serious way if you use external node over HTTP API. It may work better with embedded js-ipfs node.
WebExtension APIs do not allow for injecting response payload. All we can do is cancel or redirect. This means if you want to "load page from IPFS case" you will be redirected to local (or public) gateway, which may trigger errors due to things like CORS or mixed-content. No matter how hard we try, it WILL break some websites so we need an UI for quick opt-out when failure happens. Some prior art can be found in uMatrix or uBlock.

Overall, I am :+1: for this shipping this opt-in experiment with Companion, but it won't happen this quarter (due to the state of pubsub, js-ipfs and other priorities). Of course PRs welcome :)

skliarie commented 6 years ago

The only time-critical moments is to see whether a URL might be retrieved using IPFS. This is simple - hash the URL, lookup "local node directory" whether it has metadata on the URL content hash and has the content. This should be quick enough to be done on the fly, during page load.

Is there API to see whether the tab is open (e.g. user awaits for the page to load)? If we can see that it is not, then we can hint IPFS that it can take some more time to retrieve content for "static" URLs.

In the same vein, URLs referred in the page, could be pro-actively "ipfs-tested" in background. May be even pre-fetched.

Regarding WebExtension API limitations, I see two solutions:

when loading a page, modify its CORS headers to allow loading from 127.0.0.1
ipfs node might provide HTTPS interface with some long-running certificate for https://localhost.ipfs.org. Obviously, the domain will permanently point to 127.0.0.1.

BTW, on HTTPS site, loading images from HTTP is allowed and will not cause "mixed content" warning

RubenKelevra commented 3 years ago

I wrote a proposal which would make this possible. It would allow users to archive websites on their nodes and sign them with different methods.

Other users would be able to find them and select an entity which they trust - for example, the internet-archive.

https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4