Closed jphastings closed 4 years ago
Cool feature. cc @lgierth @diasdavid
As for tests, you probably want to add something in core/corehttp/gateway_test.go, some of the tests in there construct directories, in particular TestIPNSHostnameRedirect creates a directory with an index.html to test that working properly. I imagine a test for this 404 feature would be quite similar.
Thanks for the advice @whyrusleeping, hopefully this covers the new cases. Shout if there's anything else I can do to help!
Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.
@dignifiedquire @VictorBjelkholm I think you guys might be interested in this PR
Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.
Choosing ETag for the cache is not simple in this case as the 404.html
is taken for reverse recursive walk but I think using CID of the resolved 404.html
should work but I am not 100% sure.
We would also have to resolve the 404.html
when client caches it to confirm the ETag.
@whyrusleeping I'm thinking of the case where I request a non-existant file from the gateway today; I don't want my browser's cache to be prevented from trying again, say, tomorrow when that DNSlink entry/peerID points to a new hash where that file now exists.
Right now this PR ensures the gateway sets the ETag of a requested but missing file to be the Cid of the pretty 404 page that was found (as @Kubuxu suggests). Smart browsers can then include it in the If-None-Match
header of their next request to the gateway for this file which will avoid sending unnecessary data. I reckon this is all we need to do here to allow smart clients to request transfer of the bare minimum to show the most recent data.
Does this address your concern?
Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.
What if the gateway fails to find the file, timeouts and returns a 404 with cache-control: unlimited? An existing hash->file can be cached forever and the corollary is that a non-existing hash->file will someday become an existing hash->file.
If you load ipfs.io/ipfs/$HASH/$FILE
and $FILE
isn't found, a redirect to a 404 page under $HASH
would always be a 404.
As both Etag and Cache-Control uses the URL, it has to be a redirect and if it redirects to $HASH/404.html
, that redirect would always happen in the future.
Introducing IPNS would make things work differently though, and probably in that case it's best to dump the Cache-Control header. However, if IPFS isn't being used with IPNS, the 404 should be permanent.
What if the gateway fails to find the file, timeouts and returns a 404 with cache-control: unlimited?
In the case of ipfs.io/ipfs/$HASH/$FILE
, don't we already see all the links of $HASH
and if we cannot see the match for $FILE, we can safely show the 404 page?
As both Etag and Cache-Control uses the URL, it has to be a redirect and if it redirects to $HASH/404.html, that redirect would always happen in the future.
No redirect for 404 page please. I need to know original url where 404 happen. Then i can redirect to right location by Javascript.
I think about .htaccess for ipfs.
There's no redirecting happening at the moment. I believe @VictorBjelkholm was talking about internally, ie. the data is being delivered when a missing file is requested.
In the case of ipfs.io/ipfs/$HASH/$FILE, don't we already see all the links of $HASH and if we cannot see the match for $FILE, we can safely show the 404 page?
I agree — in this case we know that $FILE
definitely doesn't exist under the $HASH
and we can safely deliver the data from an appropriate 404.html
page from that URL provided we also set the cache headers to be non-permanent and the ETag to be of the data delivered (ie. the 404.html
page) so that if $FILE
becomes present (in the case where we're addressing via an IPNS link) the new data can be retrieved.
Further to this, in the case of ipfs.io/ipfs/$HASH/$FILE
(ie. not IPNS), we could also deliver the data with an immutable cache header, as the contents of the $HASH
will never change. If we like the idea of this I can change this conditional so that it allows setting the immutable header for /ipfs
paths with 404s. In fact, we could set the immutable cache header for any /ipfs
path, including for directories (which is not currently the case on HEAD). The only argument against this is that we are also declaring this business logic to be immutable (if we ever decided to move from 404.html
to 404_error.html
as the page too look for, what a customer sees would be dependent on which version of IPFS they first viewed that URL with).
Because of this I'm up for leaving the 404.html page without the immutable cache header, but leave the ETag in place, as it stands. It's extremely unlikely we'd ever need to change the 404.html
to something else, but better to not force immutability on the business logic at this stage I feel!
I think doing this at the Gateway is the wrong place, because if I've understood this correctly, its indicating to the caller (which might be an application, not a browser) that its retrieved the page. Shouldnt this be part of the logic in whatever is calling the Gateway (browser) to decide what to do when the page isn't found.
For example ... I can see situations where if the file can't be delivered by the Gateway, the app decides to go to somewhere else (such as an Archive) to try a different way to retrieve the same content.
@mitra42 The gateway will still return a 404 status so the application can decide what to do with it.
However, I believe this should be checking the client's Accept header. If the client doesn't want an HTML page, we shouldn't be returning an HTML 404 page.
That's an excellent idea @Stebalien; only if the request header matches text/html
should this functionality be present - I'll add that to this PR now!
@mitra42 while I agree with your point in pure terms, my understanding of the Gateway's purpose is that it should bring some of the benefits of IPFS to browsers which don't have native IPFS support, which leads me to two counter arguments:
I hope that makes sense!
Agreed,
It might be worth noting that if a person wishing to use this feature wants their pretty 404 page's assets to render correctly without specifying absolute paths (eg. if they want the 404s to work with accessing the file from an IPFS or IPNS URL like gateway.server/ipfs/HASH/missing_resource
as well as from a DNS link like example.com/missing_resource
) then some logic would need to be applied to give the browser enough information to know where to get assets from (as neither absolute nor relative paths will work on their own). I've found this snippet works well, though it will need to be tailored to how assets are set up:
<head>
<!-- things that don't refernce assets -->
<script>
const ipfsPath = /^\/ip[fn]s\/(.+?)\//.exec(window.location.pathname);
const basePath = (ipfsPath == null) ? '/' : ipfsPath[0];
document.write('<base href="' + basePath + '"/>');
</script>
<!-- things that reference assets, eg. CSS and JS -->
</head>
@whyrusleeping is there anything further you'd like me to alter on this PR?
This LGTM on a technical level. Now we need to get a signoff from the others on whether or not this is something we really want.
cc @diasdavid @dignifiedquire
Thanks @Kubuxu, I've made those changes.
Maybe in some file like .htaccess will be set right file name and content type for 404,index and other special pages? And we can use it for other site settings in future.
File: .htaccess
ErrorDocument 404 /.404.html
Will be one hardcoded file name with settings for site or directory.
@ivan386 .htaccess
files are extremely complex in what they support, using that format for describing this single file would introduce confusion as to why other .htaccess
features weren't supported.
@magik6k the use of 404.html
seems conventional (Jekyll, terraform, most static site generators) but it does seem like it might be annoying to have to rename pages on something as intra-linked as wikipedia.
I'd be up for .404.html
as that seems close to the conventions, but is even more unlikely to be used in the normal course of a website's design.
Problem I see with naming it just 404.html
is that someone might just have a page named like this. This would be the case with english dump of wikipedia and this: https://en.wikipedia.org/wiki/404 page.
@Kubuxu Yeah, this makes sense to me. I've changed the file to .404.html
(with the preceding dot) which should make it explicit enough to not clash in all but the most obscure of cases.
Ok. Let's use JSON. File: .site.conf
{
"ErrorDocument": {
"404": ".404.html",
"504": ".504.html"
}
"IndexDocument: [
"index.html",
"index.svg",
"index.rss",
"index.xml"
]
}
I'm not sure this PR benefits from the added complexity you're proposing @ivan386. Using a file (an .htaccess
file or a .site.conf
file or any other format) to specify which files to download and return when certain conditions are met means that the IPFS gateway needs to retrieve an extra file, as well as parsing it and coping with any errors in both those situations.
I'm a firm believer that the best code is the code you don't need to write; the code I've submitted here is as concise as I think I can make it while still returning a pretty 404 document, meaning fewer concepts for the next engineer to wrap their head around or augment in the future. For example, if we defined a .site.conf
file format, then later wanted to change how it worked, the gateway code would have to be backward compatible and would grow yet further in size and complexity.
The benefits are limited too, as index.html
is already returned when a directory is requested and in the case where a 504 would be returned there's a strong likelihood that the files within the specified multihash are less available, which means the likelihood of retrieving either your proposed .site.conf
file or the linked .504.html
would also be much lower.
@dignifiedquire or @diasdavid — have you had a chance to review this PR? Summary so far:
The PR's description remains accurate: this PR implements functionality within the gateway such that if a file is requested over HTTP that doesn't exist within an IPFS directory over, text/html
is Accept
able, and .404.html
exists in this or a parent directory, then it is returned with suitable caching and HTTP status code.
This accompanies the existing /
=> index.html
functionality to allow IPFs directories to present a friendlier interface to humans.
Other comments on this thread include verifying the approach and changing to .404.html
as the magic filename, as it's unlikely to clash with anything else. Please let me know if there's anything I can do to help.
About caching of 404 responses: I think we should cache these in the long term, but there are a few edge cases around directories and redirects that we should figure out first. Until then I'd like to be cautious with caching.
Thanks for the careful review @lgierth; my apologies for the stretch between your reply and my changes. I've made the alterations you suggested (ipfs-404.html
, making searchUpTreeFor404
a method on GatewayHandler
), but I've not figured out what you were suggesting on the third — I'll be happy to make changes if you're up for giving some more information.
As to your point about caching, though the first line of the diff seems like I'm altering caching functionality, that's only present as I've separated the concerns of caching (now done through the preventCache
instead of the dir
variable) and directory listing (still operating via dir
) — let me know if my understanding of what's going on is wrong!
Hi @lgierth; happy new year! I think I figured out what it was you were looking for — that a suitable 404 page should show instead of a directory listing. I've added that as a second commit here (for ease of review), and also refactored the first commit on top of master.
I've had some difficulty with the CI (I can't figure out why this build failed, it passes locally), but I think this PR is back to being in a useful state again. Would you mind having another look?
Don't worry about Travis, it's failing on master too - https://github.com/ipfs/go-ipfs/issues/5845
I just noticed that rendering a 404 will output http: multiple response.WriteHeader calls
to stderr. This is because (*gatewayHandler).serveFile
calls http.ServeContent
, which always calls w.WriteHeader(http.StatusOK)
(which is after the 404 is header is set).
This doesn't impact the status code delivered, but it is annoying. I don't want to change how (*gatewayHandler).serveFile
works in this PR (it's already been open a fair while!) I'll be happy to have a look at changing how that works in a second PR once this is merged.
This PR is finally green again; I've updated the tests to use the new helpers and tended to the comments which @lgierth left. Is there anything further you'd like me to do here?
Hey folks, I've fixed the conflicts with master — can I get some feedback on why this PR isn't being addressed?
Thank you for the reminder, this fell off our radar. I've added it to the go-ipfs 0.6.0 milestone so we don't forget it.
Thanks for all the great feedback @Stebalien! I've been writing Go at work since about a month after I learned it to write this PR, I'll get my head back in this PR this weekend and address your points.
No, thank you and sorry this dropped off our radar. Feel free to bug us (or just ping me directly) if this happens again.
And take your time. We're now on a fixed release schedule so if it doesn't make it into the next release, it'll make it into the following.
This should cover the changes you suggested @Stebalien — feedback is welcome! ☺️
:partying_face:
Thanks for your patience on this, this has been a long time coming.
If I am not mistaken, this change implies a lookup for ipfs-404.html
on every attempt to obtain a directory listing, and subsequently breaks directory listings if the file is found. Is that intentional?
In core/corehttp/gateway_test.go, I think that the following tests are wrong:
{"/deeper/", "text/html", http.StatusNotFound, "Deep custom 404"},
{"/deeper", "text/html", http.StatusNotFound, "Deep custom 404"},
These and /
should probably end up with a directory listing.
That was intentional by me — the approach that web servers like NGinx take is index.html
goes first, a 404 page goes second, then a directory listing if there's no 404 defined. I worked to echo that here.
That sounds rather unusual. By default directory indexes are disabled unless autoindex is changed. The error_page directive only affects the page shown on errors, it does not affect the routing logic (whether to display an index or not).
Did you have a custom config such as try_files $uri $uri/index.html =404;
in mind?
With the current approach, there is no way to obtain a directory index if ipfs-404.html
exists.
That's a good point; I guess it comes down to whether IPFS wants directory listings to always be possible to retrieve on the HTTP gateway.
Given this is my first PR to IPFS I don't feel like I'm qualified to make such a judgement call (beyond what I'd personally like) — so maybe we can lean on @Stebalien?
If directory listings are not desirable, one could add an explicit index.html file to hide the contents. There is probably no point in using this custom 404 page feature for hiding files since people can just query the directory directly, that is where IPFS differs from nginx.
I am also new here and do not know if special files such as index.html and ipfs-404.html are documented anywhere, but that would be nice to have :-)
Hm. Yeah, I didn't fully think through the ramifications of this. I actually have at least one case where I want to support custom 404 pages and directory listings (dist.ipfs.io).
I love that there is thinking around supporting 404 pages.
I've published my blog to IPFS/ENS and lacking 404 was something I wanted
e.g. http://blog.davidburela.eth/broken
But because it is a blog with pages in this folder format
/
/2019
/01
/02
/03
...
/2020
/01
/02
it means I'd need to put a ipfs-404.html
in every single subfolder. Maybe specifying it once at root of a merkle tree is enough?
<cid>/ipfs-404.html
Specifying it once at the root should be enough. The gateway will search all parent directories up the path for an ipfs-404.html file.
I noticed ipfs/ipfs#167 and thought this might be an elegant solution for allowing the gateway to show helpful information when requested files aren't found.
When a requested file isn't found
ipfs-404.html
is looked for in the same directory, rolling up through any of its parents, and displayed (without immutable cache headers) if present.Run this patch locally and query the following for a demonstration: