racket / web-server

Other
90 stars 47 forks source link

web-server/dispatch should facilitate handling trailing "/"s on URLs #59

Open LiberalArtist opened 5 years ago

LiberalArtist commented 5 years ago

I recently discovered that dispatch-rules and related forms make it hard to handle trailing /s on URLs sensibly.

The URL https://example.com/foo/ (with a trailing /) is treated differently than https://example.com/foo (without a trailing /). Strictly speaking, this is correct: those are two distinct URLs, and in principle one could serve completely unrelated content at each. In practice, though, users don't expect such URLs to be distinct. Serving different non-error content at one than the other is a terrible idea, and even returning a successful response from one variant but, e.g., a 404 error from the other (as my sight was doing) is likely to cause confusion. I expect most programs will want to either redirect the less-preferred variant to the canonical variant or simply treat both variants as equivalent.

Unfortunately, web-server/dispatch doesn't provide a great way to implement such behavior. Here's an example in code of the current state of affairs:

#lang racket

(require web-server/dispatch
         web-server/http
         web-server/servlet-env
         net/url
         rackunit)

(define ((handler str) req)
  (response/output
   (λ (out) (write-string str out))))

(define start
  (dispatch-case
   [() (handler "a")]
   [("") (handler "b")]
   [("foo") (handler "c")]
   [("foo" "") (handler "d")]
   [("foo" "" "") (handler "e")]
   [("foo" "" "bar") (handler "f")]))

(define (do-request str)
  (port->string (get-pure-port (string->url str))))

(check-equal?
 (let ([th (thread
            (λ ()
              (serve/servlet start
                             #:servlet-regexp #rx""
                             #:banner? #f
                             #:launch-browser? #f)))])
   (sleep 5)
   (begin0 (list (do-request "http://localhost:8000")
                 (do-request "http://localhost:8000/")
                 (do-request "http://localhost:8000/foo")
                 (do-request "http://localhost:8000/foo/")
                 (do-request "http://localhost:8000/foo//")
                 (do-request "http://localhost:8000/foo//bar"))
           (kill-thread th)))
 '("b" "b" "c" "d" "e" "f"))

This illustrates a few things:

I have changed my application to handle trailing / separators by using dispatch-rules+applies instead of dispatch-rules (which involved removing my else clause) and, if the original request does not satisfy the predicate from dispatch-rules+applies but a version with a normalized path would, responding with a redirect to the normalized path.

I think it would be better to extend web-server/dispatch to support this sort of thing, but I'm not sure yet what the best way would be to do so. A few things I've been thinking about so far:

jeapostrophe commented 5 years ago

I think the right thing to do is for you not use the result of dispatch-rules as your start, but instead use another function that does this URL rewriting on the request object and then calls the dispatch-rules function. Is there a reason you can't do that?

LiberalArtist commented 5 years ago

That is what I am doing now. I think for some reason I had originally thought that function on the right-hand side of a dispatch-rules clause was required to return a response?, but I see now that I can just return #f, so I don't need the additional complexity of dispatch-rules+applies.

As I've started diving into the weeds, I think I agree that dispatch-rules should not bake in this URL rewriting functionality, because the specifics of what rewriting you want to do (let alone how you want to respond to acceptable but non-canonical URLs) seem likely to be application-specific. There might be a few sensible choices common enough to be reusable, but working out what they are and a good API would take some experimentation before I'd want them in the standard library.

I'm still working on this for my own application: writing this issue revealed ways my URL handling still wasn't working as expected even after my first attempt at a work-around. My current plan is to:

  1. Implement a function simplify-url, analogous to simplify-path. It would combine the unexported function remove-dot-segments from net/url-string (which implements RFC 3986 §5.2.4) with a pass to eliminate redundant path separators. (I think, by analogy with simplify-path, that should be "except for a single trailing separator," which an application might or might not want to throw away.)
  2. Document the behavior of the dispatch-rules family with respect to these subtleties and the nature of request URLs supplied by the web server library, so that others don't get surprised. I think, on further consideration, I would rather note in the documentation that () is not (""), rather than making it a syntax error or warning, since it could match a manually-created request? object.
LiberalArtist commented 5 years ago

Related to my proposed simplify-url, I noticed the following match cases in web-server/dispatchers/filesystem-map, and I'm not sure that I understand why it treats "." differently than "..": https://github.com/racket/web-server/blob/894613d7c52a7b25ec1d05ac9414bb5bd94dd17a/web-server-lib/web-server/dispatchers/filesystem-map.rkt#L25-L28

As illustrated in the program below, "", ".", 'same, "..", and 'up are all distinct URL path elements, though apparently url->path treats "" as 'same (should it?). On the other hand, url->path raises an exception for URLs containing "." or ".." elements. (In that case, I think it is probably correct, but it shouldn't report the error in terms of bytes->path-element.)

I'm not sure if this code should treat "." and ".." as equivalent to 'same and 'up or not, but I would think it should do the same in both cases.

I guess I would have expected url->path to be consistent with url->string, other than the inherent differences (e.g. that filesystem paths can't percent-encode arbitrary path elements). If there is a good reason for them to behave differently, maybe my proposed simplify-url should have a mode to select which of the two behaviors to use.

LiberalArtist commented 5 years ago

(Apparently I neglected to paste the example above ...)

#lang racket

(require net/url-string
         rackunit)

(define (path-element->url elem)
   (url "https" #f "example.com" #f #t
        (list (path/param "a" null)
              (path/param elem null)
              (path/param "b" null))
        null #f))

(for ([pr '(["" . "https://example.com/a//b"]
            ["." . "https://example.com/a/%2e/b"]
            [same . "https://example.com/a/./b"]
            [".." . "https://example.com/a/%2e%2e/b"]
            [up . "https://example.com/a/../b"])])
  (match-define (cons elem rendered) pr)
  (check-equal? (url->string (path-element->url elem))
                rendered))

(check-equal? (url->path (path-element->url ""))
              (url->path (path-element->url 'same)))

(for ([special '("." "..")])
  (check-exn #rx"bytes->path-element.*names a special element"
             (λ () (url->path (path-element->url special)))))
madkins23 commented 1 year ago

I fixed this by adding an extra string-arg element to the dispatch rule and then treating it as a dummy argument in the handler function. Pretty hacky and you may need multiple dispatch rules (my case is always going to have a trailing slash) but it may be simpler to implement than other suggestions.

soegaard commented 10 months ago

A note in the documentation on dispatch-rules might be a good idea.

soegaard commented 6 months ago

Just a quick note to say, that in many situations it is simpler to redirect to the canonical url.

With these dispatch rules

   [("blog" "" (string-arg) ...)                  do-blog]           ; /blog/
   [("blog")                                      redirect-to-blog]  ; /blog   -> /blog/
(define (redirect-to-blog req)
  (redirect-to "/blog/"))

Any requests to /blog will be redirected to /blog/.

If the blog contains references to stylesheets using relative urls then the canonical url works best.