Open skliarie opened 6 years ago
You can use https://hash-archive.org/
No, I can not use hash-archive.org as it has no necessary API. Also, the implementation requires close cooperation with the site owner, and currently I don't see hash-archive doing this. I will drop them a note though.
Api not documented: https://github.com/btrask/hash-archive/blob/9d97fb7b87674094ff36b4bf4ed1f46f05f734fa/src/server.c#L129
/api/enqueue/{url}
/api/history/{url}
/api/sources/{hash}
obviously static files, such as .iso, .mp3, images...
Sadly the only "static" files are ones with SRI hash or Cache-Control: public, (..) immutable
header.
Everything else on HTTP-based web is implicitly mutable (what I mean by that is we have to assume it can change between two requests).
@skliarie Some notes in on this can be found at https://github.com/ipfs-shipyard/ipfs-companion/issues/96#issue-143726690, re-pasting them here:
- Automatic mirroring of resources
- IMMUTABLE assets: very limited feasibility, so far only two types of immutable resources on the web exist:
- JS, CSS etc marked with SRI hash (Subresource Integrity) (mapping SRI→CID)
- URLs for things explicitly marked as immutable via
Cache-Control: public, (..) immutable
(mapping URL→CID)- MUTABLE assets: what if we we add every page to IPFS store mapping between URL and CID, then if page disappear, we could fallback to IPFS version?
- a can of worms: a safe version would be like web.archive.org, but limited to a local machine, but sharing cache with other people would require centralized mapping service (single point of failure, vector for privacy leaks)
To reiterate, mapping mutable content under some URL to IPFS CID requires a centralized index which introduces huge vector for privacy leaks and a single point of failure. Not to mention index will be basically DDoSed if our user base grows too fast, and if we base lookups on HTTP requests that will degrade browsing performance even further. IMO not worth investing time with those known downsides, plus it sends really mixed-message of decentralizing using centralized server.
However, I like the fact this idea (mapping URL2IPFS) comes back every few months, which means there is some potential to it.
So what is needed to make it "right"?
There are still pubsub performance and privacy problems to solve (eg. publishing banking pages), but at least we don't rely on HTTP server anymore. :)
The whole http2ipfs ticket is essentially exercise in trust: who you are going to believe to accept HTTP pre-calculated hash from. BTW, same question must arise when choosing initial seed servers to connect to. May be they have a solution that we can reuse..
I think we should adopt self-hosted/trust building hybrid approach. Something along the lines:
ipfs-companion changes:
I would like to add couple clarifications to the idea:
The undefined group: URLs with content mime type or url extension that have "possibly static" properties (.iso, .png, etc) - categorized as "most likely static" group. Same way, URLs with short caching time are categorized into "dynamic" group. Not sure whether URLs of dynamic nature (.php, .asp, etc) should be categorized into "dynamic" group. Same goes for sites with Basic Auth protection.
The ipfs-companion should have toggle (button?) on whether to treat the group URLs in "undefined" as dynamic or "most likely static". This way users with bad rendering (as result of stale content) might refresh the page "properly". IMHO this should be done on per-site basis, as I doubt there are many such "dumb" sites (that use static format for dynamic results). When this toggle is activated, the URL (or all "maybe static" URLs) of the site are marked as dynamic and published in the corresponding pubsub room(s).
URLs in undefined group are not "published" in pubsub rooms, only "dynamic" and "most likely static" ones.
For security sensitive sites, there should be toggle to mark all URLs as "dynamic" ones. The toggle might ask for confirmation whether to "disable IPFS caching for the whole https://bankofamerica.com site".
definitely static: headers="Cache-Control: public" or SRI hash is present
I think you meant Cache-Control: immutable
(public
is not even "most likely static", as it does not guarantee content will not change at some point in future, it just says it is safe for current version to be cached for a time period specified in max-age
attribute)
I doubt there are many such "dumb" sites (that use static format for dynamic results)
Statically-generated blogs are a thing, and there are basically "static" websites generated on the fly by PHP or ASP. And a lot of websites hide .php
or .aspx
from URLs and hide vendor-specific headers, so you don't even know what is happening behind the scenes. All you can do is follow hints from generic HTTP headers such as Cache-Control
and Etag
.
The risk of breaking websites by over-caching is extremely high and writing custom heuristics is a maintenance hell prone to false-positives.
I feel the safe way to do it to just follow semantics of Cache-Control
and max-age
(if present).
This header is already respected by browsers and website owners and could be parsed as indicator if specific asset can be cached in IPFS. AFAIK all browsers (well, at least Chrome, Firefox) cache HTTPS content by default for some arbitrary time (if Cache-Control
is missing), unless explicitly told not to cache via Cache-Control
header.
The ipfs-companion should have toggle (button?) on whether to [..] refresh the page "properly".
Agreed, there should always be way to exclude website from mirroring experiment.
For security sensitive sites, there should be toggle to mark all URLs as "dynamic" ones. The toggle might ask for confirmation whether to "disable IPFS caching for the whole https://bankofamerica.com site".
I am afraid manual opt-out is not a safe way to do it.
Sensitive page already leak into the public on initial load, it is too late to mark it as "do not leak this" after the fact.
Some open problems related to security:
Cache-Control: no-cache, no-store, must-revalidate, max-age = 0
to indicate that sensitive page/asset should not be cached. But is it enough? I suspect there are websites who allow caching of sensitive content because it improves website speed and they assume cache is the the browser cache that does not leave user's machine. Lets clarify the groups according per "shareability" attribute:
After that point our concern is only with content that many (more than 20 users) have access to, but still is sensitive (protected by password) or restricted by network (see below).
For that, the "dynamic" toggle button I mentioned above is used. Once clicked, it should prompt user with selectable list of top-level domains seen on the page. The user then selects domains that must be marked as "dynamic". This will open pubsub rooms with hash of domains and publish the domains as "dynamic" and thus never cacheable.
Need to think what to do with evil actors that want to disable IPFS cache for foreign (competitor?) site.. Or someone that does not have a clue and selects all of them...
With such conservative caching approach, I don't see possible harm done, do you?
I am a bit skeptical about finding an easy fix that protects users from sharing secrets or being fed invalid content by bad actors that can spawn multiple nodes. Those are complex problems that should not be underestimated.
We should play it safe and start with a simple opt-in experiment that has a smaller number of moving pieces and is easier to reason about:
Cache-Control: immutable
and things with SRI that don't have caching disabled via Cache-Control
)*
wildcard, but user should make conscious decision to set that) Cache-Control
semantics present in web browsers, namely respects max-age
set by website and has the same defaults when Cache-Control
is missing.That being said, some potential blockers:
Overall, I am :+1: for this shipping this opt-in experiment with Companion, but it won't happen this quarter (due to the state of pubsub, js-ipfs and other priorities). Of course PRs welcome :)
The only time-critical moments is to see whether a URL might be retrieved using IPFS. This is simple - hash the URL, lookup "local node directory" whether it has metadata on the URL content hash and has the content. This should be quick enough to be done on the fly, during page load.
Is there API to see whether the tab is open (e.g. user awaits for the page to load)? If we can see that it is not, then we can hint IPFS that it can take some more time to retrieve content for "static" URLs.
In the same vein, URLs referred in the page, could be pro-actively "ipfs-tested" in background. May be even pre-fetched.
Regarding WebExtension API limitations, I see two solutions:
BTW, on HTTPS site, loading images from HTTP is allowed and will not cause "mixed content" warning
I wrote a proposal which would make this possible. It would allow users to archive websites on their nodes and sign them with different methods.
Other users would be able to find them and select an entity which they trust - for example, the internet-archive.
https://discuss.ipfs.io/t/ipfs-records-for-urn-uri-resolving-via-a-dht/10456/4
By design IPFS provides excellent caching and accelerating mechanism. It would be nice to use IPFS as HTTP accelerator. Obviously, it might be used for static files only.
My proposal is as follows:
Initially we might use the acceleration for obviously static files, such as .iso, .mp3, images...
I volunteer to do the part 1. Can you help with 2? What do you think?