web-server/dispatch should facilitate handling trailing "/"s on URLs

LiberalArtist commented 5 years ago

I recently discovered that dispatch-rules and related forms make it hard to handle trailing /s on URLs sensibly.

The URL https://example.com/foo/ (with a trailing /) is treated differently than https://example.com/foo (without a trailing /). Strictly speaking, this is correct: those are two distinct URLs, and in principle one could serve completely unrelated content at each. In practice, though, users don't expect such URLs to be distinct. Serving different non-error content at one than the other is a terrible idea, and even returning a successful response from one variant but, e.g., a 404 error from the other (as my sight was doing) is likely to cause confusion. I expect most programs will want to either redirect the less-preferred variant to the canonical variant or simply treat both variants as equivalent.

Unfortunately, web-server/dispatch doesn't provide a great way to implement such behavior. Here's an example in code of the current state of affairs:

#lang racket

(require web-server/dispatch
         web-server/http
         web-server/servlet-env
         net/url
         rackunit)

(define ((handler str) req)
  (response/output
   (λ (out) (write-string str out))))

(define start
  (dispatch-case
   [() (handler "a")]
   [("") (handler "b")]
   [("foo") (handler "c")]
   [("foo" "") (handler "d")]
   [("foo" "" "") (handler "e")]
   [("foo" "" "bar") (handler "f")]))

(define (do-request str)
  (port->string (get-pure-port (string->url str))))

(check-equal?
 (let ([th (thread
            (λ ()
              (serve/servlet start
                             #:servlet-regexp #rx""
                             #:banner? #f
                             #:launch-browser? #f)))])
   (sleep 5)
   (begin0 (list (do-request "http://localhost:8000")
                 (do-request "http://localhost:8000/")
                 (do-request "http://localhost:8000/foo")
                 (do-request "http://localhost:8000/foo/")
                 (do-request "http://localhost:8000/foo//")
                 (do-request "http://localhost:8000/foo//bar"))
           (kill-thread th)))
 '("b" "b" "c" "d" "e" "f"))

This illustrates a few things:

Because of how HTTP requests work, the root URL is always equivalent with or without a trailing / and has a single, empty path element ("").
() is allowed as a pattern, but (due to the above) will never match anything. I think it might be better to make this a syntax error (or a warning that will become an error in a future release).
A trailing / adds an empty path element ("") to the end of the URL.
There is no cleanse-path-like case for multiple adjacent / separators.

I have changed my application to handle trailing / separators by using dispatch-rules+applies instead of dispatch-rules (which involved removing my else clause) and, if the original request does not satisfy the predicate from dispatch-rules+applies but a version with a normalized path would, responding with a redirect to the normalized path.

I think it would be better to extend web-server/dispatch to support this sort of thing, but I'm not sure yet what the best way would be to do so. A few things I've been thinking about so far:

The current language for dispatch-rules patterns doesn't have a notion of "splicing" patterns: each string literal or bidi-match-expander use applies to a single path element.
Simply giving the same response to all variants seems relatively straight-forward to add as a keyword option. However, that is (at least arguably) less optimal than redirecting to the canonical URL. Redirects open another can of worms, though, since 301 Moved Permanently has issues for methods other than GET and HEAD and 308 Permanent Redirect doesn't have a straightforward fallback for older browsers. I would want to do a 301 Moved Permanently for GET and HEAD requests and 307 Temporary Redirect for other methods, which RFC 7231 suggests is reasonable. The higher-level question is the trade-off between a simple API that "does the right thing" and configurability.
Building on that theme, there are some cases other than trailing / support where a similar ability to handle variants on canonical URLs would be nice: for example, case-insensitivity. I can imagine possible extensions to the dispatch-rules API to add a general concept of non-canonical URLs, and it is appealing from one perspective to avoid making trailing / support a baked-in special feature. On the other hand, though, there seems a risk of getting beyond the scope of this library.

jeapostrophe commented 5 years ago

I think the right thing to do is for you not use the result of dispatch-rules as your start, but instead use another function that does this URL rewriting on the request object and then calls the dispatch-rules function. Is there a reason you can't do that?

LiberalArtist commented 5 years ago

That is what I am doing now. I think for some reason I had originally thought that function on the right-hand side of a dispatch-rules clause was required to return a response?, but I see now that I can just return #f, so I don't need the additional complexity of dispatch-rules+applies.

As I've started diving into the weeds, I think I agree that dispatch-rules should not bake in this URL rewriting functionality, because the specifics of what rewriting you want to do (let alone how you want to respond to acceptable but non-canonical URLs) seem likely to be application-specific. There might be a few sensible choices common enough to be reusable, but working out what they are and a good API would take some experimentation before I'd want them in the standard library.

I'm still working on this for my own application: writing this issue revealed ways my URL handling still wasn't working as expected even after my first attempt at a work-around. My current plan is to:

Implement a function simplify-url, analogous to simplify-path. It would combine the unexported function remove-dot-segments from net/url-string (which implements RFC 3986 §5.2.4) with a pass to eliminate redundant path separators. (I think, by analogy with simplify-path, that should be "except for a single trailing separator," which an application might or might not want to throw away.)
Document the behavior of the dispatch-rules family with respect to these subtleties and the nature of request URLs supplied by the web server library, so that others don't get surprised. I think, on further consideration, I would rather note in the documentation that () is not (""), rather than making it a syntax error or warning, since it could match a manually-created request? object.

LiberalArtist commented 5 years ago

Related to my proposed simplify-url, I noticed the following match cases in web-server/dispatchers/filesystem-map, and I'm not sure that I understand why it treats "." differently than "..": https://github.com/racket/web-server/blob/894613d7c52a7b25ec1d05ac9414bb5bd94dd17a/web-server-lib/web-server/dispatchers/filesystem-map.rkt#L25-L28

As illustrated in the program below, "", ".", 'same, "..", and 'up are all distinct URL path elements, though apparently url->path treats "" as 'same (should it?). On the other hand, url->path raises an exception for URLs containing "." or ".." elements. (In that case, I think it is probably correct, but it shouldn't report the error in terms of bytes->path-element.)

I'm not sure if this code should treat "." and ".." as equivalent to 'same and 'up or not, but I would think it should do the same in both cases.

I guess I would have expected url->path to be consistent with url->string, other than the inherent differences (e.g. that filesystem paths can't percent-encode arbitrary path elements). If there is a good reason for them to behave differently, maybe my proposed simplify-url should have a mode to select which of the two behaviors to use.

LiberalArtist commented 5 years ago

(Apparently I neglected to paste the example above ...)

#lang racket

(require net/url-string
         rackunit)

(define (path-element->url elem)
   (url "https" #f "example.com" #f #t
        (list (path/param "a" null)
              (path/param elem null)
              (path/param "b" null))
        null #f))

(for ([pr '(["" . "https://example.com/a//b"]
            ["." . "https://example.com/a/%2e/b"]
            [same . "https://example.com/a/./b"]
            [".." . "https://example.com/a/%2e%2e/b"]
            [up . "https://example.com/a/../b"])])
  (match-define (cons elem rendered) pr)
  (check-equal? (url->string (path-element->url elem))
                rendered))

(check-equal? (url->path (path-element->url ""))
              (url->path (path-element->url 'same)))

(for ([special '("." "..")])
  (check-exn #rx"bytes->path-element.*names a special element"
             (λ () (url->path (path-element->url special)))))

madkins23 commented 1 year ago

I fixed this by adding an extra string-arg element to the dispatch rule and then treating it as a dummy argument in the handler function. Pretty hacky and you may need multiple dispatch rules (my case is always going to have a trailing slash) but it may be simpler to implement than other suggestions.

soegaard commented 10 months ago

A note in the documentation on dispatch-rules might be a good idea.

soegaard commented 6 months ago

Just a quick note to say, that in many situations it is simpler to redirect to the canonical url.

With these dispatch rules

   [("blog" "" (string-arg) ...)                  do-blog]           ; /blog/
   [("blog")                                      redirect-to-blog]  ; /blog   -> /blog/

(define (redirect-to-blog req)
  (redirect-to "/blog/"))

Any requests to /blog will be redirected to /blog/.

If the blog contains references to stylesheets using relative urls then the canonical url works best.

racket / web-server

web-server/dispatch should facilitate handling trailing "/"s on URLs #59