ipfs / ipfs-companion

Browser extension that simplifies access to IPFS resources on the web
https://docs.ipfs.tech/install/ipfs-companion/
Creative Commons Zero v1.0 Universal
2.07k stars 324 forks source link

Automatic mirroring of HTTP websites to IPFS as you browse them #535

Open skliarie opened 6 years ago

skliarie commented 6 years ago

By design IPFS provides excellent caching and accelerating mechanism. It would be nice to use IPFS as HTTP accelerator. Obviously, it might be used for static files only.

My proposal is as follows:

  1. Create secure hash storing website "IPFS2HTTP". It would store URL, IPFS hash, expiration time. 1.1 Have api to get URL and return IPFS hash, expiration time 1.2 Have "workers" to retrieve unknown or expired URLs, add it to IPFS and keep hash in internal db.
  2. Add to ipfs-companion option to enable "http2ipfs" translation, in following way: 2.1 For each never-seen-before URL do this (with 5s timeout): 2.1.1 Go to the "IPFS2HTTP" site, pass the URL, see if it has hash for the URL. Get the hash, fetch it over IPFS, return to user. Fall back to HTTP if no such hash. 2.2. If URL already seen - check its time to live (as received from IPFS2HTTP site) and use IPFS hash if possible. Otherwise retrieve new hash as in 2.1.1

Initially we might use the acceleration for obviously static files, such as .iso, .mp3, images...

I volunteer to do the part 1. Can you help with 2? What do you think?

ivan386 commented 6 years ago

You can use https://hash-archive.org/

skliarie commented 6 years ago

No, I can not use hash-archive.org as it has no necessary API. Also, the implementation requires close cooperation with the site owner, and currently I don't see hash-archive doing this. I will drop them a note though.

ivan386 commented 6 years ago

Api not documented: https://github.com/btrask/hash-archive/blob/9d97fb7b87674094ff36b4bf4ed1f46f05f734fa/src/server.c#L129

/api/enqueue/{url}
/api/history/{url}
/api/sources/{hash}
lidel commented 6 years ago

obviously static files, such as .iso, .mp3, images...

Sadly the only "static" files are ones with SRI hash or Cache-Control: public, (..) immutable header. Everything else on HTTP-based web is implicitly mutable (what I mean by that is we have to assume it can change between two requests).

@skliarie Some notes in on this can be found at https://github.com/ipfs-shipyard/ipfs-companion/issues/96#issue-143726690, re-pasting them here:

  • Automatic mirroring of resources
    • IMMUTABLE assets: very limited feasibility, so far only two types of immutable resources on the web exist:
      • JS, CSS etc marked with SRI hash (Subresource Integrity) (mapping SRI→CID)
      • URLs for things explicitly marked as immutable via Cache-Control: public, (..) immutable (mapping URL→CID)
    • MUTABLE assets: what if we we add every page to IPFS store mapping between URL and CID, then if page disappear, we could fallback to IPFS version?
      • a can of worms: a safe version would be like web.archive.org, but limited to a local machine, but sharing cache with other people would require centralized mapping service (single point of failure, vector for privacy leaks)

To reiterate, mapping mutable content under some URL to IPFS CID requires a centralized index which introduces huge vector for privacy leaks and a single point of failure. Not to mention index will be basically DDoSed if our user base grows too fast, and if we base lookups on HTTP requests that will degrade browsing performance even further. IMO not worth investing time with those known downsides, plus it sends really mixed-message of decentralizing using centralized server.

However, I like the fact this idea (mapping URL2IPFS) comes back every few months, which means there is some potential to it.

So what is needed to make it "right"?

There are still pubsub performance and privacy problems to solve (eg. publishing banking pages), but at least we don't rely on HTTP server anymore. :)

skliarie commented 6 years ago

The whole http2ipfs ticket is essentially exercise in trust: who you are going to believe to accept HTTP pre-calculated hash from. BTW, same question must arise when choosing initial seed servers to connect to. May be they have a solution that we can reuse..

I think we should adopt self-hosted/trust building hybrid approach. Something along the lines:

ipfs-companion changes:

skliarie commented 6 years ago

I would like to add couple clarifications to the idea:

The undefined group: URLs with content mime type or url extension that have "possibly static" properties (.iso, .png, etc) - categorized as "most likely static" group. Same way, URLs with short caching time are categorized into "dynamic" group. Not sure whether URLs of dynamic nature (.php, .asp, etc) should be categorized into "dynamic" group. Same goes for sites with Basic Auth protection.

The ipfs-companion should have toggle (button?) on whether to treat the group URLs in "undefined" as dynamic or "most likely static". This way users with bad rendering (as result of stale content) might refresh the page "properly". IMHO this should be done on per-site basis, as I doubt there are many such "dumb" sites (that use static format for dynamic results). When this toggle is activated, the URL (or all "maybe static" URLs) of the site are marked as dynamic and published in the corresponding pubsub room(s).

URLs in undefined group are not "published" in pubsub rooms, only "dynamic" and "most likely static" ones.

For security sensitive sites, there should be toggle to mark all URLs as "dynamic" ones. The toggle might ask for confirmation whether to "disable IPFS caching for the whole https://bankofamerica.com site".

lidel commented 6 years ago

definitely static: headers="Cache-Control: public" or SRI hash is present

I think you meant Cache-Control: immutable (public is not even "most likely static", as it does not guarantee content will not change at some point in future, it just says it is safe for current version to be cached for a time period specified in max-age attribute)

I doubt there are many such "dumb" sites (that use static format for dynamic results)

Statically-generated blogs are a thing, and there are basically "static" websites generated on the fly by PHP or ASP. And a lot of websites hide .php or .aspx from URLs and hide vendor-specific headers, so you don't even know what is happening behind the scenes. All you can do is follow hints from generic HTTP headers such as Cache-Control and Etag.

The risk of breaking websites by over-caching is extremely high and writing custom heuristics is a maintenance hell prone to false-positives.

I feel the safe way to do it to just follow semantics of Cache-Control and max-age (if present). This header is already respected by browsers and website owners and could be parsed as indicator if specific asset can be cached in IPFS. AFAIK all browsers (well, at least Chrome, Firefox) cache HTTPS content by default for some arbitrary time (if Cache-Control is missing), unless explicitly told not to cache via Cache-Control header.

The ipfs-companion should have toggle (button?) on whether to [..] refresh the page "properly".

Agreed, there should always be way to exclude website from mirroring experiment.

For security sensitive sites, there should be toggle to mark all URLs as "dynamic" ones. The toggle might ask for confirmation whether to "disable IPFS caching for the whole https://bankofamerica.com site".

I am afraid manual opt-out is not a safe way to do it.
Sensitive page already leak into the public on initial load, it is too late to mark it as "do not leak this" after the fact.

Some open problems related to security:

skliarie commented 6 years ago

Lets clarify the groups according per "shareability" attribute:

After that point our concern is only with content that many (more than 20 users) have access to, but still is sensitive (protected by password) or restricted by network (see below).

For that, the "dynamic" toggle button I mentioned above is used. Once clicked, it should prompt user with selectable list of top-level domains seen on the page. The user then selects domains that must be marked as "dynamic". This will open pubsub rooms with hash of domains and publish the domains as "dynamic" and thus never cacheable.

Need to think what to do with evil actors that want to disable IPFS cache for foreign (competitor?) site.. Or someone that does not have a clue and selects all of them...

With such conservative caching approach, I don't see possible harm done, do you?

lidel commented 6 years ago

I am a bit skeptical about finding an easy fix that protects users from sharing secrets or being fed invalid content by bad actors that can spawn multiple nodes. Those are complex problems that should not be underestimated.

We should play it safe and start with a simple opt-in experiment that has a smaller number of moving pieces and is easier to reason about:

That being said, some potential blockers:

Overall, I am :+1: for this shipping this opt-in experiment with Companion, but it won't happen this quarter (due to the state of pubsub, js-ipfs and other priorities). Of course PRs welcome :)

skliarie commented 6 years ago

The only time-critical moments is to see whether a URL might be retrieved using IPFS. This is simple - hash the URL, lookup "local node directory" whether it has metadata on the URL content hash and has the content. This should be quick enough to be done on the fly, during page load.

Is there API to see whether the tab is open (e.g. user awaits for the page to load)? If we can see that it is not, then we can hint IPFS that it can take some more time to retrieve content for "static" URLs.

In the same vein, URLs referred in the page, could be pro-actively "ipfs-tested" in background. May be even pre-fetched.

Regarding WebExtension API limitations, I see two solutions:

BTW, on HTTPS site, loading images from HTTP is allowed and will not cause "mixed content" warning

RubenKelevra commented 3 years ago

I wrote a proposal which would make this possible. It would allow users to archive websites on their nodes and sign them with different methods.

Other users would be able to find them and select an entity which they trust - for example, the internet-archive.

https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4