ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
15.83k stars 2.96k forks source link

Gateway renders pretty 404 pages if available #4233

Closed jphastings closed 4 years ago

jphastings commented 6 years ago

I noticed ipfs/ipfs#167 and thought this might be an elegant solution for allowing the gateway to show helpful information when requested files aren't found.

When a requested file isn't found ipfs-404.html is looked for in the same directory, rolling up through any of its parents, and displayed (without immutable cache headers) if present.

Run this patch locally and query the following for a demonstration:

whyrusleeping commented 6 years ago

Cool feature. cc @lgierth @diasdavid

whyrusleeping commented 6 years ago

As for tests, you probably want to add something in core/corehttp/gateway_test.go, some of the tests in there construct directories, in particular TestIPNSHostnameRedirect creates a directory with an index.html to test that working properly. I imagine a test for this 404 feature would be quite similar.

jphastings commented 6 years ago

Thanks for the advice @whyrusleeping, hopefully this covers the new cases. Shout if there's anything else I can do to help!

whyrusleeping commented 6 years ago

Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.

@dignifiedquire @VictorBjelkholm I think you guys might be interested in this PR

Kubuxu commented 6 years ago

Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.

Choosing ETag for the cache is not simple in this case as the 404.htmlis taken for reverse recursive walk but I think using CID of the resolved 404.html should work but I am not 100% sure.

We would also have to resolve the 404.html when client caches it to confirm the ETag.

jphastings commented 6 years ago

@whyrusleeping I'm thinking of the case where I request a non-existant file from the gateway today; I don't want my browser's cache to be prevented from trying again, say, tomorrow when that DNSlink entry/peerID points to a new hash where that file now exists.

Right now this PR ensures the gateway sets the ETag of a requested but missing file to be the Cid of the pretty 404 page that was found (as @Kubuxu suggests). Smart browsers can then include it in the If-None-Match header of their next request to the gateway for this file which will avoid sending unnecessary data. I reckon this is all we need to do here to allow smart clients to request transfer of the bare minimum to show the most recent data.

Does this address your concern?

daviddias commented 6 years ago

Thinking about this, what is the need for not setting a cache? Everything is content addressed, right? There should be no need not to cache.

What if the gateway fails to find the file, timeouts and returns a 404 with cache-control: unlimited? An existing hash->file can be cached forever and the corollary is that a non-existing hash->file will someday become an existing hash->file.

victorb commented 6 years ago

If you load ipfs.io/ipfs/$HASH/$FILE and $FILE isn't found, a redirect to a 404 page under $HASH would always be a 404.

As both Etag and Cache-Control uses the URL, it has to be a redirect and if it redirects to $HASH/404.html, that redirect would always happen in the future.

Introducing IPNS would make things work differently though, and probably in that case it's best to dump the Cache-Control header. However, if IPFS isn't being used with IPNS, the 404 should be permanent.

What if the gateway fails to find the file, timeouts and returns a 404 with cache-control: unlimited?

In the case of ipfs.io/ipfs/$HASH/$FILE, don't we already see all the links of $HASH and if we cannot see the match for $FILE, we can safely show the 404 page?

ivan386 commented 6 years ago

As both Etag and Cache-Control uses the URL, it has to be a redirect and if it redirects to $HASH/404.html, that redirect would always happen in the future.

No redirect for 404 page please. I need to know original url where 404 happen. Then i can redirect to right location by Javascript.

I think about .htaccess for ipfs.

jphastings commented 6 years ago

There's no redirecting happening at the moment. I believe @VictorBjelkholm was talking about internally, ie. the data is being delivered when a missing file is requested.

In the case of ipfs.io/ipfs/$HASH/$FILE, don't we already see all the links of $HASH and if we cannot see the match for $FILE, we can safely show the 404 page?

I agree — in this case we know that $FILE definitely doesn't exist under the $HASH and we can safely deliver the data from an appropriate 404.html page from that URL provided we also set the cache headers to be non-permanent and the ETag to be of the data delivered (ie. the 404.html page) so that if $FILE becomes present (in the case where we're addressing via an IPNS link) the new data can be retrieved.

Further to this, in the case of ipfs.io/ipfs/$HASH/$FILE (ie. not IPNS), we could also deliver the data with an immutable cache header, as the contents of the $HASH will never change. If we like the idea of this I can change this conditional so that it allows setting the immutable header for /ipfs paths with 404s. In fact, we could set the immutable cache header for any /ipfs path, including for directories (which is not currently the case on HEAD). The only argument against this is that we are also declaring this business logic to be immutable (if we ever decided to move from 404.html to 404_error.html as the page too look for, what a customer sees would be dependent on which version of IPFS they first viewed that URL with).

Because of this I'm up for leaving the 404.html page without the immutable cache header, but leave the ETag in place, as it stands. It's extremely unlikely we'd ever need to change the 404.html to something else, but better to not force immutability on the business logic at this stage I feel!

mitra42 commented 6 years ago

I think doing this at the Gateway is the wrong place, because if I've understood this correctly, its indicating to the caller (which might be an application, not a browser) that its retrieved the page. Shouldnt this be part of the logic in whatever is calling the Gateway (browser) to decide what to do when the page isn't found.

For example ... I can see situations where if the file can't be delivered by the Gateway, the app decides to go to somewhere else (such as an Archive) to try a different way to retrieve the same content.

Stebalien commented 6 years ago

@mitra42 The gateway will still return a 404 status so the application can decide what to do with it.

However, I believe this should be checking the client's Accept header. If the client doesn't want an HTML page, we shouldn't be returning an HTML 404 page.

jphastings commented 6 years ago

That's an excellent idea @Stebalien; only if the request header matches text/html should this functionality be present - I'll add that to this PR now!

@mitra42 while I agree with your point in pure terms, my understanding of the Gateway's purpose is that it should bring some of the benefits of IPFS to browsers which don't have native IPFS support, which leads me to two counter arguments:

I hope that makes sense!

mitra42 commented 6 years ago

Agreed,

jphastings commented 6 years ago

It might be worth noting that if a person wishing to use this feature wants their pretty 404 page's assets to render correctly without specifying absolute paths (eg. if they want the 404s to work with accessing the file from an IPFS or IPNS URL like gateway.server/ipfs/HASH/missing_resource as well as from a DNS link like example.com/missing_resource) then some logic would need to be applied to give the browser enough information to know where to get assets from (as neither absolute nor relative paths will work on their own). I've found this snippet works well, though it will need to be tailored to how assets are set up:

<head>
  <!-- things that don't refernce assets -->
  <script>
    const ipfsPath = /^\/ip[fn]s\/(.+?)\//.exec(window.location.pathname);
    const basePath = (ipfsPath == null) ? '/' : ipfsPath[0];
    document.write('<base href="' + basePath + '"/>');
  </script>
  <!-- things that reference assets, eg. CSS and JS -->
</head>
jphastings commented 6 years ago

@whyrusleeping is there anything further you'd like me to alter on this PR?

whyrusleeping commented 6 years ago

This LGTM on a technical level. Now we need to get a signoff from the others on whether or not this is something we really want.

cc @diasdavid @dignifiedquire

jphastings commented 6 years ago

Thanks @Kubuxu, I've made those changes.

ivan386 commented 6 years ago

Maybe in some file like .htaccess will be set right file name and content type for 404,index and other special pages? And we can use it for other site settings in future. File: .htaccess ErrorDocument 404 /.404.html

Will be one hardcoded file name with settings for site or directory.

jphastings commented 6 years ago

@ivan386 .htaccess files are extremely complex in what they support, using that format for describing this single file would introduce confusion as to why other .htaccess features weren't supported.

@magik6k the use of 404.html seems conventional (Jekyll, terraform, most static site generators) but it does seem like it might be annoying to have to rename pages on something as intra-linked as wikipedia.

I'd be up for .404.html as that seems close to the conventions, but is even more unlikely to be used in the normal course of a website's design.

Kubuxu commented 6 years ago

Problem I see with naming it just 404.html is that someone might just have a page named like this. This would be the case with english dump of wikipedia and this: https://en.wikipedia.org/wiki/404 page.

jphastings commented 6 years ago

@Kubuxu Yeah, this makes sense to me. I've changed the file to .404.html (with the preceding dot) which should make it explicit enough to not clash in all but the most obscure of cases.

ivan386 commented 6 years ago

Ok. Let's use JSON. File: .site.conf

{
  "ErrorDocument": {
    "404": ".404.html",
    "504": ".504.html"
  }
  "IndexDocument: [
    "index.html",
    "index.svg",
    "index.rss",
    "index.xml"
  ]
}
  1. Find .site.conf
  2. Read 404 path
  3. Return 404 page
jphastings commented 6 years ago

I'm not sure this PR benefits from the added complexity you're proposing @ivan386. Using a file (an .htaccess file or a .site.conf file or any other format) to specify which files to download and return when certain conditions are met means that the IPFS gateway needs to retrieve an extra file, as well as parsing it and coping with any errors in both those situations.

I'm a firm believer that the best code is the code you don't need to write; the code I've submitted here is as concise as I think I can make it while still returning a pretty 404 document, meaning fewer concepts for the next engineer to wrap their head around or augment in the future. For example, if we defined a .site.conf file format, then later wanted to change how it worked, the gateway code would have to be backward compatible and would grow yet further in size and complexity.

The benefits are limited too, as index.html is already returned when a directory is requested and in the case where a 504 would be returned there's a strong likelihood that the files within the specified multihash are less available, which means the likelihood of retrieving either your proposed .site.conf file or the linked .504.html would also be much lower.

jphastings commented 6 years ago

@dignifiedquire or @diasdavid — have you had a chance to review this PR? Summary so far:

The PR's description remains accurate: this PR implements functionality within the gateway such that if a file is requested over HTTP that doesn't exist within an IPFS directory over, text/html is Acceptable, and .404.html exists in this or a parent directory, then it is returned with suitable caching and HTTP status code.

This accompanies the existing / => index.html functionality to allow IPFs directories to present a friendlier interface to humans.

Other comments on this thread include verifying the approach and changing to .404.html as the magic filename, as it's unlikely to clash with anything else. Please let me know if there's anything I can do to help.

ghost commented 6 years ago

About caching of 404 responses: I think we should cache these in the long term, but there are a few edge cases around directories and redirects that we should figure out first. Until then I'd like to be cautious with caching.

jphastings commented 6 years ago

Thanks for the careful review @lgierth; my apologies for the stretch between your reply and my changes. I've made the alterations you suggested (ipfs-404.html, making searchUpTreeFor404 a method on GatewayHandler), but I've not figured out what you were suggesting on the third — I'll be happy to make changes if you're up for giving some more information.

As to your point about caching, though the first line of the diff seems like I'm altering caching functionality, that's only present as I've separated the concerns of caching (now done through the preventCache instead of the dir variable) and directory listing (still operating via dir) — let me know if my understanding of what's going on is wrong!

jphastings commented 5 years ago

Hi @lgierth; happy new year! I think I figured out what it was you were looking for — that a suitable 404 page should show instead of a directory listing. I've added that as a second commit here (for ease of review), and also refactored the first commit on top of master.

I've had some difficulty with the CI (I can't figure out why this build failed, it passes locally), but I think this PR is back to being in a useful state again. Would you mind having another look?

magik6k commented 5 years ago

Don't worry about Travis, it's failing on master too - https://github.com/ipfs/go-ipfs/issues/5845

jphastings commented 5 years ago

I just noticed that rendering a 404 will output http: multiple response.WriteHeader calls to stderr. This is because (*gatewayHandler).serveFile calls http.ServeContent, which always calls w.WriteHeader(http.StatusOK) (which is after the 404 is header is set).

This doesn't impact the status code delivered, but it is annoying. I don't want to change how (*gatewayHandler).serveFile works in this PR (it's already been open a fair while!) I'll be happy to have a look at changing how that works in a second PR once this is merged.

jphastings commented 5 years ago

This PR is finally green again; I've updated the tests to use the new helpers and tended to the comments which @lgierth left. Is there anything further you'd like me to do here?

jphastings commented 4 years ago

Hey folks, I've fixed the conflicts with master — can I get some feedback on why this PR isn't being addressed?

Stebalien commented 4 years ago

Thank you for the reminder, this fell off our radar. I've added it to the go-ipfs 0.6.0 milestone so we don't forget it.

jphastings commented 4 years ago

Thanks for all the great feedback @Stebalien! I've been writing Go at work since about a month after I learned it to write this PR, I'll get my head back in this PR this weekend and address your points.

Stebalien commented 4 years ago

No, thank you and sorry this dropped off our radar. Feel free to bug us (or just ping me directly) if this happens again.

And take your time. We're now on a fixed release schedule so if it doesn't make it into the next release, it'll make it into the following.

jphastings commented 4 years ago

This should cover the changes you suggested @Stebalien — feedback is welcome! ☺️

Stebalien commented 4 years ago

:partying_face:

Thanks for your patience on this, this has been a long time coming.

Lekensteyn commented 3 years ago

If I am not mistaken, this change implies a lookup for ipfs-404.html on every attempt to obtain a directory listing, and subsequently breaks directory listings if the file is found. Is that intentional?

In core/corehttp/gateway_test.go, I think that the following tests are wrong:

{"/deeper/", "text/html", http.StatusNotFound, "Deep custom 404"},
{"/deeper", "text/html", http.StatusNotFound, "Deep custom 404"},

These and / should probably end up with a directory listing.

jphastings commented 3 years ago

That was intentional by me — the approach that web servers like NGinx take is index.html goes first, a 404 page goes second, then a directory listing if there's no 404 defined. I worked to echo that here.

Lekensteyn commented 3 years ago

That sounds rather unusual. By default directory indexes are disabled unless autoindex is changed. The error_page directive only affects the page shown on errors, it does not affect the routing logic (whether to display an index or not).

Did you have a custom config such as try_files $uri $uri/index.html =404; in mind?

With the current approach, there is no way to obtain a directory index if ipfs-404.html exists.

jphastings commented 3 years ago

That's a good point; I guess it comes down to whether IPFS wants directory listings to always be possible to retrieve on the HTTP gateway.

Given this is my first PR to IPFS I don't feel like I'm qualified to make such a judgement call (beyond what I'd personally like) — so maybe we can lean on @Stebalien?

Lekensteyn commented 3 years ago

If directory listings are not desirable, one could add an explicit index.html file to hide the contents. There is probably no point in using this custom 404 page feature for hiding files since people can just query the directory directly, that is where IPFS differs from nginx.

I am also new here and do not know if special files such as index.html and ipfs-404.html are documented anywhere, but that would be nice to have :-)

Stebalien commented 3 years ago

Hm. Yeah, I didn't fully think through the ramifications of this. I actually have at least one case where I want to support custom 404 pages and directory listings (dist.ipfs.io).

DavidBurela commented 3 years ago

I love that there is thinking around supporting 404 pages. I've published my blog to IPFS/ENS and lacking 404 was something I wanted e.g. http://blog.davidburela.eth/broken

But because it is a blog with pages in this folder format

/
  /2019
    /01
    /02
    /03
    ...
  /2020
    /01
    /02

it means I'd need to put a ipfs-404.html in every single subfolder. Maybe specifying it once at root of a merkle tree is enough? <cid>/ipfs-404.html

Stebalien commented 3 years ago

Specifying it once at the root should be enough. The gateway will search all parent directories up the path for an ipfs-404.html file.