jetty / jetty.project

Eclipse Jetty® - Web Container & Clients - supports HTTP/2, HTTP/1.1, HTTP/1.0, websocket, servlets, and more

https://eclipse.dev/jetty

Other

3.86k stars 1.91k forks source link

Jetty-12 URI encoding #7713

Closed gregw closed 1 year ago

gregw commented 2 years ago

There has always been an issue with URI handling in the servlet URI and in jetty that we had a choice between working with:

encoded URIs - which have too many variations of the equivalent URI (eg encoding characters that do not need to be encoded)
decoded paths, that work fine for most cases, but lose information if the original URI contained %2F encodings or encoded . or .. segments

For Jetty-12, I think we should follow the servlet-6.0 example and work with canonically encoded paths - i.e. paths that are substantially decoded, but that leave %2F characters encoded. This would, for example, allow a resource service to correctly serve a file that has a / in it's name rather than as a segment separator.

The issue here, is that we need to trace the path all the way through to resource services and default servlets, so that no subsequent handling (after security constraints) decodes the %2F and treats it like a file separator... so that filenames like /foo/bar%2F..%2F..%2F../etc/passwd do not become a problem. Note that we will keep our URI compliance mechanisms so servers can reject all such ambiguous URIs and will only allow them into the application if so configured.

gregw commented 2 years ago

@joakime I'm not sure we have time to do the resource service this week, but I think you and I should work on this - perhaps I'll do the URI decoding side if you look at the resource service (and rewrite) assuming a canonically encoded path (which I think just means %2F will remain encoded... hmmm perhaps also %25 so we can deal with %252F)

gregw commented 2 years ago

Note the other thing we need to think about with regards to resource service, is should the mimetype stuff move from the servlet contexts to the core context - ie allowing each core context to have it's own mapping of extensions to mime type and from mimetype to encodings etc.

joakime commented 2 years ago

If we have core level MimeTypes, does that mean the Servlet level inherits the configuration from core, and then applies it's own MimeTypes on top of it (eg: from the web descriptor)?

joakime commented 2 years ago

We should decode all characters that do not result in a character in the URI reserved set (aka: delimiters)

From https://datatracker.ietf.org/doc/html/rfc3986#section-2.2

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Here's an example of using URI Template.

URI-Template: "/api/{branch}/{reference}"
Request: /api/fix%2Fbuffer-overflow/src%2Fmain%2Fjava%2Forg%2Fcompany%2FBuffers.java
Result:
  "branch"    : "fix%2Fbuffer-overflow"
  "reference" : "src%2Fmain%2Fjava%2Forg%2Fcompany%2FBuffers.java"

The encoded form is maintained.

But what if the request is encoded oddly?

Example: all a characters are encoded to %61

URI-Template: "/api/{branch}/{reference}"
Request: /%61pi/fix%2Fbuffer-overflow/src%2Fm%61in%2Fj%61v%61%2Forg%2Fcomp%61ny%2FBuffers.j%61v%61
Result:
  "branch"    : "fix%2Fbuffer-overflow"
  "reference" : "src%2Fmain%2Fjava%2Forg%2Fcompany%2FBuffers.java"

This means Jetty decodes the request (all pct-encodings that do not result in a reserved character) before handling, the uri-template matching behaves the same.

joakime commented 2 years ago

Some equivalent URI examples.

Are these equivalent?

Example: 1

URI A - /api/foo?y=z
URI B - /api/foo%3Fy=z

Example: 2

/api/foo?y=z?x=y
/api/foo?y=z%3Fx=y

joakime commented 2 years ago

Per https://datatracker.ietf.org/doc/html/rfc3986#section-7.3

7.3 - Back-End Transcoding

   Special care should be taken when the URI path interpretation process
   involves the use of a back-end file system or related system
   functions.  File systems typically assign an operational meaning to
   special characters, such as the "/", "\", ":", "[", and "]"
   characters, and to special device names like ".", "..", "...", "aux",
   "lpt", etc.  In some cases, merely testing for the existence of such
   a name will cause the operating system to pause or invoke unrelated
   system calls, leading to significant security concerns regarding
   denial of service and unintended data transfer.  It would be
   impossible for this specification to list all such significant
   characters and device names.  Implementers should research the
   reserved names and characters for the types of storage device that
   may be attached to their applications and restrict the use of data
   obtained from URI components accordingly.

and https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.2

6.2.2.2 - Percent-Encoding Normalization

   The percent-encoding mechanism (Section 2.1) is a frequent source of
   variance among otherwise identical URIs.  In addition to the case
   normalization issue noted above, some URI producers percent-encode
   octets that do not require percent-encoding, resulting in URIs that
   are equivalent to their non-encoded counterparts.  These URIs should
   be normalized by decoding any percent-encoded octet that corresponds
   to an unreserved character, as described in Section 2.3.

We should leave the reserved characters alone when decoding.

joakime commented 2 years ago

I see the following as a sane approach.

We have tiers of information on the request URI

Raw URI - necessary for request signatures a. Decode Unreserved - safe decode, entire provided URI, all parts b. Split (scheme, userinfo, hostport, path, query, fragment) - URI / HttpURI object
- Path normalize - canonical path

Resulting in accessing the URI in the following APIs.

String Request.getRawURI() // the as-sent URI in the request (no decoding, changes, etc)
HttpURI Request.getDecodedURI() // Performs Unreserved Decode as part of parse (entire URI, all parts before parsing, scheme, host, userinfo, path, query, etc)
String HttpURI.getPath(); // decoded path - only unreserved decoded (no path normalization / path walking)
String HttpURI.getNormalizedPath(); // normalized path ("//", "/../", etc removed)
URI HttpURI.getURI(); // unreserved decoded, not-path-normalized, version

When comparing 2 HttpURI instances, we should support 2 modes. Unreserved decoded form against Unreserved decoded form. (use of "/a/b/../c" and "/a/c" should not be equal) Path normalized against Path normalized. (use of "/a/b/../c" and "/a/c" should be equal)

Normalization is optional behavior in most cases, used for context-path / PathMapping matching, but beyond that the non-normalized path should be seen.

I would expect the following to be true.

HttpURI uriA = new HttpURI("/a/b/../c");
URI uriB = URI.create("/z/../a/c");
assertTrue(uriA.getURI().normalize(), uriB.normalize());

gregw commented 2 years ago

@joakime I'm not exactly on the same page as you... maybe it is naming.

Firstly I don't think we need new Request methods, rather we need to better define the existing methods on HttpURI that is available from the request. We have gone to a lot of effort to ensure that the same information is not available from more than 1 method in request (where possible) to ensure that wrapping is as simple as possible. Hence we need to be very hesitant about adding new methods.

But more importantly, I don't think we can have a "decoded" URI, because if we fully decode we lose information and can have a non-valid API. We can have a "canonical" or "normalized" URI that decodes everything that is safe to decode and perhaps removes dot segments. But it is still a URI and thus is still can have encoded characters.

We can then have the same concept for the path and the query strings (modulo the question of if the canonical path should have all the reserved chars encoded or just the chars specific to paths).

So currently I'm thinking we need to:

add the method HttpURI.getCanonicalURI() that gives the whole URI in canonical form (not sure if this is a string or another HttpURI?)
add the method HttpURI.getCanonicalPath() that gives the canonical path. perhaps a pathQuery version of this as well.
deprecated the HttpURI.getDecodedPath() method, as it gives ambiguous results. If apps want the fully decoded path, they can decode the canonical string themselves.
Update the Request.getPathInContext() method to work on the canonical path rather than the decoded path.
Perhaps have the same thing for query.... although mostly we do not use the query string as a string, so it is less of a problem, but no harm having a HttpURI.getCanonicalQuery()

I think "canonical" is better than "normal" but I could probably be convinced the other way.

joakime commented 2 years ago

Per RFC naming ...

HttpURI.getCanonicalURI() would be defined as full-uri normalization (which is the combination of "6.2.2 : Syntax-Based Normalization", "6.2.2.1 : Case Normalization", and "6.2.2.2 : Percent-Encoding Normalization").
HttpURI.getCanonicalPath() is just "URI -> full-uri normalization -> R.path" - no need for a separate method IMO
HttpURI.getDecodedPath() is "URI -> R.path -> Percent-Encoding Normalization"
Request.getPathInContext() is outside of the scope of the RFC naming. :nerd_face:

gregw commented 2 years ago

Interestingly enough we already have String URIUtil.getCanonicalPath(String) and String URIUtil.getDecodedPath(String), with the former doing just dot segment removal and the later doing full % decoding. They are called moderately frequently, from numerous locations. Thus if we change them, we will need to review the callers in detail.... but then I think we need to do that also.

I would like to change to: getCanonicalPath will do dot segment removal, safe % decoding and case normalization (%2f to %2F) ; getDecodedPath will still do full % decoding. But I'll need to carefully consider all the callers first.

However, we need to consider what is safe % decoding. I think reserved chars might be too much:

      reserved    = gen-delims / sub-delims
      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

As this would mean that a URI like http://host/foo,bar would need to have the , actually encoded by getCanonicalURI(), else we would still have different strings representing the same URI.

The RFC says (emphasis is mine):

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.

So we only need to protect the characters specific to our scheme. I think a good test is that we should only leave encoded characters that we would not otherwise have to encode to ensure a single canonical version of any string. So for example a URI like http://host/foo,[]=+*:@bar can have a canonical path of //foo,[]=+*:@ without the need for any of those characters to be encoded. I think it is only /, ?, '#and%that we need to protect for the sake of URIs, but I think we have to add;` because of servlet path parameter processing. If we left just those characters encoded (with their encoded hex capitalized), then we should have a single canonical path for any URI that can be directly used to build a single canonical URI without any further encoding.

The key thing to remember, is that no matter what, this string is a URI and thus can not safely be passed to anything like File(String) or even Path(String). When looking for a static resource, we will always need to go via something like Paths.get(URI) with the URI being built from our resource base (as a URI) appended with the canonicalPath. Having a good resource cache is going to be important!

joakime commented 2 years ago

As this would mean that a URI like http://host/foo,bar would need to have the , actually encoded by getCanonicalURI(), else we would still have different strings representing the same URI.

That's not what it means.

It means that http://host/foo,bar is not equivalent to http://host/foo%2cbar. Those reference different resources, in all cases, client terms (think cache matching), proxy terms, and server terms. (covered in the RFC).

The canonicalURI method should not encode those reserved characters. In fact it I strongly believe that it should never encode characters. If you pass in nonsense, the HttpURI should reject it as an invalid URI. The canonicalURI method should simply not decode any encoded character that would resolve to a reserved character.

Here's a demo.

package uri;

import java.net.URI;
import java.net.URISyntaxException;
import java.util.stream.Stream;

import org.eclipse.jetty.http.HttpURI;
import org.eclipse.jetty.util.TypeUtil;
import org.eclipse.jetty.util.Utf8StringBuilder;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;
import org.junit.jupiter.params.provider.ValueSource;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotEquals;
import static org.junit.jupiter.api.Assertions.assertThrows;

public class RFC3986UriNormalizationTest
{
    public static Stream<Arguments> rfcUrlNormalizedEquivalent()
    {
        return Stream.of(
            Arguments.of("HTTP://WEBTIDE.COM/foo", "http://webtide.com/foo"),
            Arguments.of("HTTP://WEBTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar"),
            Arguments.of("HTTP://W%45BTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar"),
            Arguments.of("HTTP://W%45BTIDE.COM/foo/b%61r/../zed", "http://webtide.com/foo/zed")
        );
    }

    @ParameterizedTest
    @MethodSource("rfcUrlNormalizedEquivalent")
    public void testRfcNormalizedEquivalent(String rawUriA, String rawUriB)
    {
        URI a = newRFC3986NormalizedUri(rawUriA);
        URI b = newRFC3986NormalizedUri(rawUriB);
        assertEquals(a, b);
    }

    public static Stream<Arguments> rfcUrlEquivalent()
    {
        return Stream.of(
            Arguments.of("HTTP://WEBTIDE.COM/foo", "http://webtide.com/foo"),
            Arguments.of("HTTP://WEBTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar"),
            Arguments.of("HTTP://W%45BTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar")
        );
    }

    @ParameterizedTest
    @MethodSource("rfcUrlEquivalent")
    public void testRfcDecodedEquivalent(String rawUriA, String rawUriB)
    {
        URI a = newRFC3986Uri(rawUriA);
        URI b = newRFC3986Uri(rawUriB);
        assertEquals(a, b, "Equivalent check");
    }

    public static Stream<Arguments> rfcUrlDifferent()
    {
        return Stream.of(
            Arguments.of("http://host/foo,bar", "http://host/foo%2cbar"), // comma is a reserved character
            Arguments.of("http://host/foo%2fbar", "http://host/foo/bar") // slash is a reserved character
        );
    }

    @ParameterizedTest
    @MethodSource("rfcUrlDifferent")
    public void testRfcDecodedDifferent(String rawUriA, String rawUriB)
    {
        URI a = newRFC3986NormalizedUri(rawUriA);
        URI b = newRFC3986NormalizedUri(rawUriB);
        assertNotEquals(a, b, "Different check");
    }

    public static Stream<String> badUriExamples()
    {
        return Stream.of(
            "/a b/",
            "http://foo/a b/",
            "ht%74p://foo/a/",
            "http://foo/a%A0%A1/", // invalid 2 octet UTF-8 sequence
            "a%01b://authority",
            "://authority",
            "http:"
        );
    }

    @ParameterizedTest
    @MethodSource("badUriExamples")
    public void testJavaUriNonsense(String rawUri)
    {
        assertThrows(URISyntaxException.class, () -> new URI(rawUri));
    }

    @ParameterizedTest
    @MethodSource("badUriExamples")
    public void testJettyUriNonsense(String rawUri)
    {
        assertThrows(Exception.class, () -> {
            // These URI examples should result in an error (somewhere. either constructor, or via a HttpURI.throwIfInvalid() style)
            // Currently doesn't do that.
            HttpURI httpUri = new HttpURI(rawUri);
        });
    }

    private static URI newRFC3986NormalizedUri(String rawuri)
    {
        return newRFC3986Uri(rawuri).normalize();
    }

    private static URI newRFC3986Uri(String rawuri)
    {
        return URI.create(decodeUnreserved(rawuri));
    }

    private static String decodeUnreserved(String str)
    {
        if ((str == null) || (str.length() == 0))
            return null;

        int idx = str.indexOf('%');
        if (idx == -1)
        {
            // no hits
            return str;
        }

        int len = str.length();
        Utf8StringBuilder ret = new Utf8StringBuilder(len);
        ret.append(str, 0, idx);

        for (int i = idx; i < len; i++)
        {
            char c = str.charAt(i);
            if (c == '%')
            {
                if ((i + 2) < len)
                {
                    char u = str.charAt(i + 1);
                    char l = str.charAt(i + 2);
                    char result = (char)(0xff & (TypeUtil.convertHexDigit(u) * 16 + TypeUtil.convertHexDigit(l)));
                    if (isReservedChar(result))
                    {
                        ret.append('%');
                    }
                    else
                    {
                        ret.append(result);
                        i += 2;
                    }
                }
                else
                {
                    throw new IllegalArgumentException("Bad URI % encoding");
                }
            }
            else
            {
                ret.append(c);
            }
        }
        return ret.toString();
    }

    private static boolean isReservedChar(char c)
    {
        /*
        reserved    = gen-delims / sub-delims
        gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
        sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                     / "*" / "+" / "," / ";" / "="
        */
        return (c == ':') || (c == '/') || (c == '?') || (c == '#') || (c == '[') || (c == ']') || (c == '@')
            || (c == '!') || (c == '$') || (c == '&') || (c == '\'') || (c == '(') || (c == ')')
            || (c == '*') || (c == '+') || (c == ',') || (c == ';') || (c == '=');
    }
}

gregw commented 2 years ago

@joakime I don't think you have interpreted the RFC correctly. Specifically file:/tmp/foo%2cbar and file:/tmp/foo,bar are indeed the same resource.
From RFC 3986 2.4 When to Encode or Decode:

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.

The character , is not significant to the scheme-specific dereferencing for neither http nor file, so by the RFC it can be safely decoded and will not be mistaken for a component delimiter. This is different to file:/tmp/foo?bar and file:/tmp/foo%3Fbar because the ? is significant as a component delimiter and when given as ? means the start of the query string, but if given as %3F then the ? is part of the path segment. Doing the same logic with , and %2c means that if the URI contains just , then it is part of the segment as it is not a delimiter, and when given as %2c it's decoded value is also part of the path segment.

This is illustrated by the following two URIs that hit the same resource on github:

This is further illustrated by the following code:

    @Test
    public void testURI() throws Exception
    {
        URI encoded = new URI("file:/tmp/foo%2cbar");
        System.err.printf("encoded=%s%n", encoded);
        System.err.printf("Paths.get(encoded)=%s%n", Paths.get(encoded));
        System.err.printf("new File(encoded)=%s%n", new File(encoded));

        URI decoded = new URI("file:/tmp/foo,bar");
        System.err.printf("decoded=%s%n", decoded);
        System.err.printf("Paths.get(decoded)=%s%n", Paths.get(decoded));
        System.err.printf("new File(decoded)=%s%n", new File(decoded));

        File file = new File("/tmp/foo,bar");
        if (file.exists())
            file.delete();
        OutputStream out = new FileOutputStream(file);
        out.write("Hello World\n".getBytes(StandardCharsets.UTF_8));
        out.close();

        BufferedReader in = new BufferedReader(new FileReader(new File(encoded)));
        System.err.println(in.readLine());
    }

That produces the output:

encoded=file:/tmp/foo%2cbar
Paths.get(encoded)=/tmp/foo,bar
new File(encoded)=/tmp/foo,bar
decoded=file:/tmp/foo,bar
Paths.get(decoded)=/tmp/foo,bar
new File(decoded)=/tmp/foo,bar
Hello World

Note the JVM has decoded the %2C to a , even before the file exists, so there is no canonical/real filename processing happening. We cannot leave %2C encoded in our canonical URI handling, else we will have two string paths referencing the same resource and thus our path will not be canonical.

I believe that in a canonical URI, that we need leave only %, /, ?, #, and ; of the separated characters encoded because there is no possibility of any of those characters being included in a URI as anything other than separators, without being encoded. This means that we will never need to encode any characters in order to obtain a canonical URI.

However, doing some more experiments with the URI class, I can see that they are not exactly consistent with the RFC with regards to what decoded characters they will accept in a URI path. I've generated the following list of characters that the JVM URI class will accept as part of the path:

00: file:/foo bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
01: file:/foobar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
...
1f: file:/foobar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
20: file:/foo bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
21: file:/foo!bar -> encode=false : /foo!bar
22: file:/foo"bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo"bar
23: file:/foo#bar -> encode=true : /foo
24: file:/foo$bar -> encode=false : /foo$bar
25: file:/foo%bar -> encode=false : /foo%bar
26: file:/foo&bar -> encode=false : /foo&bar
27: file:/foo'bar -> encode=false : /foo'bar
28: file:/foo(bar -> encode=false : /foo(bar
29: file:/foo)bar -> encode=false : /foo)bar
2a: file:/foo*bar -> encode=false : /foo*bar
2b: file:/foo+bar -> encode=false : /foo+bar
2c: file:/foo,bar -> encode=false : /foo,bar
2d: file:/foo-bar -> encode=false : /foo-bar
2e: file:/foo.bar -> encode=false : /foo.bar
2f: file:/foo/bar -> encode=false : /foo/bar
30: file:/foo0bar -> encode=false : /foo0bar
31: file:/foo1bar -> encode=false : /foo1bar
32: file:/foo2bar -> encode=false : /foo2bar
33: file:/foo3bar -> encode=false : /foo3bar
34: file:/foo4bar -> encode=false : /foo4bar
35: file:/foo5bar -> encode=false : /foo5bar
36: file:/foo6bar -> encode=false : /foo6bar
37: file:/foo7bar -> encode=false : /foo7bar
38: file:/foo8bar -> encode=false : /foo8bar
39: file:/foo9bar -> encode=false : /foo9bar
3a: file:/foo:bar -> encode=false : /foo:bar
3b: file:/foo;bar -> encode=false : /foo;bar
3c: file:/foo<bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo<bar
3d: file:/foo=bar -> encode=false : /foo=bar
3e: file:/foo>bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo>bar
3f: file:/foo?bar -> encode=true : /foo
40: file:/foo@bar -> encode=false : /foo@bar
41: file:/fooAbar -> encode=false : /fooAbar
42: file:/fooBbar -> encode=false : /fooBbar
43: file:/fooCbar -> encode=false : /fooCbar
44: file:/fooDbar -> encode=false : /fooDbar
45: file:/fooEbar -> encode=false : /fooEbar
46: file:/fooFbar -> encode=false : /fooFbar
47: file:/fooGbar -> encode=false : /fooGbar
48: file:/fooHbar -> encode=false : /fooHbar
49: file:/fooIbar -> encode=false : /fooIbar
4a: file:/fooJbar -> encode=false : /fooJbar
4b: file:/fooKbar -> encode=false : /fooKbar
4c: file:/fooLbar -> encode=false : /fooLbar
4d: file:/fooMbar -> encode=false : /fooMbar
4e: file:/fooNbar -> encode=false : /fooNbar
4f: file:/fooObar -> encode=false : /fooObar
50: file:/fooPbar -> encode=false : /fooPbar
51: file:/fooQbar -> encode=false : /fooQbar
52: file:/fooRbar -> encode=false : /fooRbar
53: file:/fooSbar -> encode=false : /fooSbar
54: file:/fooTbar -> encode=false : /fooTbar
55: file:/fooUbar -> encode=false : /fooUbar
56: file:/fooVbar -> encode=false : /fooVbar
57: file:/fooWbar -> encode=false : /fooWbar
58: file:/fooXbar -> encode=false : /fooXbar
59: file:/fooYbar -> encode=false : /fooYbar
5a: file:/fooZbar -> encode=false : /fooZbar
5b: file:/foo[bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo[bar
5c: file:/foo\bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo\bar
5d: file:/foo]bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo]bar
5e: file:/foo^bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo^bar
5f: file:/foo_bar -> encode=false : /foo_bar
60: file:/foo`bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo`bar
61: file:/fooabar -> encode=false : /fooabar
62: file:/foobbar -> encode=false : /foobbar
63: file:/foocbar -> encode=false : /foocbar
64: file:/foodbar -> encode=false : /foodbar
65: file:/fooebar -> encode=false : /fooebar
66: file:/foofbar -> encode=false : /foofbar
67: file:/foogbar -> encode=false : /foogbar
68: file:/foohbar -> encode=false : /foohbar
69: file:/fooibar -> encode=false : /fooibar
6a: file:/foojbar -> encode=false : /foojbar
6b: file:/fookbar -> encode=false : /fookbar
6c: file:/foolbar -> encode=false : /foolbar
6d: file:/foombar -> encode=false : /foombar
6e: file:/foonbar -> encode=false : /foonbar
6f: file:/fooobar -> encode=false : /fooobar
70: file:/foopbar -> encode=false : /foopbar
71: file:/fooqbar -> encode=false : /fooqbar
72: file:/foorbar -> encode=false : /foorbar
73: file:/foosbar -> encode=false : /foosbar
74: file:/footbar -> encode=false : /footbar
75: file:/fooubar -> encode=false : /fooubar
76: file:/foovbar -> encode=false : /foovbar
77: file:/foowbar -> encode=false : /foowbar
78: file:/fooxbar -> encode=false : /fooxbar
79: file:/fooybar -> encode=false : /fooybar
7a: file:/foozbar -> encode=false : /foozbar
7b: file:/foo{bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo{bar
7c: file:/foo|bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo|bar
7d: file:/foo}bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo}bar
7e: file:/foo~bar -> encode=false : /foo~bar
7f: file:/foobar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar

Which indicates that we need to leave encoded the following characters:

00: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
01: java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
...
1f: java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
20: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
22: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo"bar
23: file:/foo#bar
3c: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo<bar
3e: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo>bar
3f: file:/foo?bar
5b: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo[bar
5c: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo\bar
5d: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo]bar
5e: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo^bar
60: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo`bar
7b: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo{bar
7c: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo|bar
7d: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo}bar
7f: java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar

The other characters are accepted in URI paths, so either we need to force them to be always decoded or safely decode them knowing that the URI class will handle them OK.

joakime commented 2 years ago

From RFC 3986 2.4 When to Encode or Decode:

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.

The character , is not significant to the scheme-specific dereferencing for neither http nor file, so by the RFC it can be safely decoded and will not be mistaken for a component delimiter. This is different to file:/tmp/foo?bar and file:/tmp/foo%3Fbar because the ? is significant as a component delimiter and when given as ? means the start of the query string, but if given as %3F then the ? is part of the path segment. Doing the same logic with , and %2c means that if the URI contains just , then it is part of the segment as it is not a delimiter, and when given as %2c it's decoded value is also part of the path segment.

"scheme-specific" is important in that statement. The IANA registered schemes - https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml The schemes "http" and "https" are registered at https://www.rfc-editor.org/rfc/rfc8615.html

Lets look at https://www.rfc-editor.org/rfc/rfc8615.html#section-3

This specification updates the "http" [RFC7230] and "https" [RFC7230] schemes to support well-known URIs. Other existing schemes can use the appropriate process for updating their definitions; for example, [RFC8307] does so for the "ws" and "wss" schemes. The "Uniform Resource Identifier (URI) Schemes" registry tracks which schemes support well-known URIs; see Section 5.2.

So it defers to RFC7230, which has a section about the Scheme "http" and "https" registration at at https://www.rfc-editor.org/rfc/rfc7230#section-8.2

Which points to https://www.rfc-editor.org/rfc/rfc7230#section-2.7.1 - http URI Scheme

Which then covers how to deal with path, normalization, and equivalency at https://www.rfc-editor.org/rfc/rfc7230#section-2.7.3

The scheme and host are case-insensitive and normally provided in lowercase; all other components are compared in a case-sensitive manner. Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [RFC3986]).

Now back to RFC3986 and encode/decode ...

Section 2.4 is a generic statement about when to encode/decode. Section 2.3 is a generic statement about what to encode/decode. Section 6 is a more specific section on how to decode.

Reserved characters are golden (per RFC3986 Section 2.3, and RFC3986 Section 6, and RFC7230: Section 2.7.3). Jetty (as the container) should never encode/decode them, for any reason. Only the application can make the call on decoding them (or not).

Take this for example ...

These are not equivalent, and don't point to the same resource.

w3c access normal slash

$ curl -v --http1.1 http://www.w3.org/Addressing/URL/uri-spec.html | head 25
head: cannot open '25' for reading: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 128.30.52.100:80...
* TCP_NODELAY set
* Connected to www.w3.org (128.30.52.100) port 80 (#0)
> GET /Addressing/URL/uri-spec.html HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< date: Fri, 11 Mar 2022 17:45:50 GMT
< last-modified: Wed, 01 Feb 1995 20:02:14 GMT
< etag: "e96e-2d004c9973d80"
< accept-ranges: bytes
< content-length: 59758
< cache-control: max-age=21600
< expires: Fri, 11 Mar 2022 23:45:50 GMT
< vary: Accept-Encoding,upgrade-insecure-requests
< content-type: text/html; charset=utf-8
< x-backend: www-mirrors
< x-request-id: 622b8ace8c755d8e
< 
{ [1058 bytes data]
* Failed writing body (142 != 1448)
  9 59758    9  5402    0     0  45779      0  0:00:01 --:--:--  0:00:01 45779
* Closing connection 0
curl: (23) Failed writing body (142 != 1448)

w3c access escaped reserved slash

$ curl -v --http1.1 http://www.w3.org/Addressing/URL%2Furi-spec.html | head 25
head: cannot open '25' for reading: No such file or directory
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 128.30.52.100:80...
* TCP_NODELAY set
* Connected to www.w3.org (128.30.52.100) port 80 (#0)
> GET /Addressing/URL%2Furi-spec.html HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Fri, 11 Mar 2022 17:46:45 GMT
< vary: upgrade-insecure-requests
< content-length: 6715
< content-type: text/html;charset=utf-8
< x-backend: www-mirrors
< x-request-id: 622b8b05bab835d7
< 
{ [1235 bytes data]
* Failed writing body (1413 != 1448)
 61  6715   61  4131    0     0  29091      0 --:--:-- --:--:-- --:--:-- 29091
* Closing connection 0
curl: (23) Failed writing body (1413 != 1448)

These are also not equivalent.

apache access normal slash

$ $ curl -v --http1.1 https://downloads.apache.org//httpd/mod_ftp/mod_ftp-0.9.6-beta.tar.gz.sha1
*   Trying 135.181.214.104:443...
* TCP_NODELAY set
* Connected to downloads.apache.org (135.181.214.104) port 443 (#0)
..(snip TLS debug)..
*  SSL certificate verify ok.
> GET //httpd/mod_ftp/mod_ftp-0.9.6-beta.tar.gz.sha1 HTTP/1.1
> Host: downloads.apache.org
> User-Agent: curl/7.68.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Fri, 11 Mar 2022 17:39:09 GMT
< Server: Apache
< Last-Modified: Mon, 06 Jul 2020 14:16:58 GMT
< ETag: "44-5a9c68711ec55"
< Accept-Ranges: bytes
< Content-Length: 68
< Access-Control-Allow-Origin: *
< Content-Type: text/plain
< 
365fb2eaa72c3e9e3d6a35485cdf685e1b460428  mod_ftp-0.9.6-beta.tar.gz
* Connection #0 to host downloads.apache.org left intact

apache access escaped reserved slash

$ curl -v --http1.1 https://downloads.apache.org//httpd/mod_ftp%2Fmod_ftp-0.9.6-beta.tar.gz.sha1
*   Trying 135.181.214.104:443...
* TCP_NODELAY set
* Connected to downloads.apache.org (135.181.214.104) port 443 (#0)
..(snip TLS debug)..
*  SSL certificate verify ok.
> GET //httpd/mod_ftp%2Fmod_ftp-0.9.6-beta.tar.gz.sha1 HTTP/1.1
> Host: downloads.apache.org
> User-Agent: curl/7.68.0
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Date: Fri, 11 Mar 2022 17:47:58 GMT
< Server: Apache
< Content-Length: 196
< Content-Type: text/html; charset=iso-8859-1
< 
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
* Connection #0 to host downloads.apache.org left intact

gregw commented 2 years ago

These are not equivalent, and don't point to the same resource.

https://www.w3.org/Addressing/URL/uri-spec.html

https://www.w3.org/Addressing/URL%2Furi-spec.html

Exactly! %2F is meaningful in both http: and file: schemes. But %2C is meaningful in neither.

So we are definitely going to leave %2F encoded in the canonical uri/path But I believe that we should decode %2C in the canonical uri/path, else we create an alias with two URIs referencing the same resource.

The problem comes with how do we systematically determine which characters we should leave encoded. I think the "system" is that we should leave encoded:

/, ?, # because they are separators that we need to delimit the path in a URI
% because we are encoded and thus need to escape the encoding character
; because the servlet specification forces us to deal with path parameters, even though they are no longer formally defined.
Any character not accepted in decoded form by the URI class (eg <= 0x20, [, ] and a few others) because we need to be able to pass our canonical path to the URI class in order to safely access a File via the Path API.

It is the last of these that lacks rigour, as I can't see any RFC reason why the URI class accepts :, ;, , as decoded characters in a path, but not ^, | etc. It may be that we need to determine that character set at runtime by testing the URI class (including for non ascii characters since we receive the URI as a string). This set might even differ between JVMs, which would suck.

So alternately, we can come up with our own set of characters that is a super set of what the URI class does not accept, but that is more closely aligned to the RFC, but then we will need to actually encode some characters when making a canonical URI. So if for example we include , in our superset, then we will have to encode that to %2C in our canonical path, so that we never have 2 URIs for the same resource.

sbordet commented 2 years ago

Don't forget \, i.e. %5C, as in http://host/windows%5Cpath.

gregw commented 2 years ago

@sbordet that doesn't need to be encoded for the sake of interpreting the URI. It is not ambiguous in a URI path as \ is not significant. It is significant on a windows files style, which is why we should not do new File(base + pathInContext), but must instead go via the URI class.

joakime commented 2 years ago

When handling a raw URI, it should be handled in layers.

Raw URI to uri-subcomponents (gen-delims)
uri-subcomponents split (sub-delims)
scheme specific behaviors in uri-subcomponents

URI reserved character basics

This is the defined list all the way up at the URI spec (no scheme specifics)

      reserved    = gen-delims / sub-delims
      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

gen-delims

These reserved characters are used to separate the uri-subcomponents.

: uri-subcomponent authority (IP-literal separator, end of host, start of port), end of scheme
/ uri-subcomponent indicating start of authority uri-subcomponent, and start of path uri-subcomponent, and separator of path segments.
? uri-subcomponent start of query section
# uri-subcomponent start of fragment
[ uri-subcomponent authority IP-literal start (start of host)
] uri-subcomponent authority IP-literal end (end of host)
@ uri-subcomponent authority end of userinfo, start of host

These specific separators are concerned about the first level breakdown of the uri into uri-subcomponents. There spec has language on how to find these general delims (pay attention to grouping and what sub-component you are in currently). No decoding of pct-encoded characters should occur yet at this level.

Example scheme://user%3Ainfo@host/path, if the userinfo needs to have a :, then it needs to be encoded.

eg:

The : literal can exist in the host portion of the authority as part of an IP-Literal, and in the path, query, and fragment portions as a literal colon. This ':' is especially tricky (and called out in the various specs) when dealing with relative URI references (eg: this:that vs ./this:that)
The / literal can exist in the path portion, and is a sub-delim for path-segment. when seen in other uri-subcomponents it is a literal slash (not a delim).
Each scheme-specific behavior calls out extra rules (eg: http has designated that within a query component, the characters ";", "/", "?", ":", "@", "&", "=", "+", ",", and "$" are reserved) - which means that when we build a query using Fields within HttpClient, we should be encoding all of those characters (if they appear in the key or value portions)

sub-delims

These are characters that have meaning within a specific uri-subcomponent. The rules here, as defined by RFC

Char	sub-component(s)	meaning	example
`!`	path	scheme specific mark character	n/a
`$`	query	authority/userinfo use is literal, query	n/a
`&`	query	parameter separator	`foo://auth/zed?a=b&c=d`
`'`	path	scheme specific mark character	n/a
`(`	path	scheme specific mark character	n/a
`)`	path	scheme specific mark character	n/a
`*`	path	scheme specific mark character	n/a
`+`	scheme & query	in scheme we don't care it's not http or file, in query is a space, everywhere is a plus sign	`a+b://auth/?a=b+2`
`,`	path	parameter start and separator	`scheme://authority/name,1.1`
`;`	path	parameter start	`scheme://authority/name;v=1.1`
`=`	path & query	parameter assignment	`/name;v=1.1` and `/name?a=b`

The sub-delims designated as "scheme specific mark character" (!, ', (, ), and *) are reserved all the way up at the URI spec level, and have meaning for various scheme specific specs.

Scope of work in Jetty

Supporting web schemes

Since we are talking about Jetty and more specifically the HttpURI class we are maintaining, lets focus on the following scheme specific behavior.

http and https are defined by RFC8615
ws and wss are defined by RFC8307 - which delegates to RFC7230 http/https for scheme specific behavior / definition.

The schemes I would expect HttpURI to support are http, https, ws, and wss. Example: For the purposes of our HttpURI class, the sub-delims designated as "scheme specific mark character" (!, ', (, ), and *) have no special meaning with http, https, ws, and wss schemes, and encoded form is equivalent to decoded form, as they have no delimitor behavior in these schemes.

As far as path parameters are concerned, they are poorly defined at best, and removed entirely in reality. I don't think Jetty should attempt to process them. It might complicate path matching if we don't support at least removing the path params before matching. This means ; (semi-colon), and , (comma) in the path uri-subcomponent will need to be handled before matching.

We are HTTP, and I think we should not attempt to support path params. Apache Httpd doesn't support them as path params (at least for serving static file content). Example: https://joakim.erdfelt.com/jetty/reserved/raw;path=params/hello.txt You can follow this path (navigating directory listings) from https://joakim.erdfelt.com/jetty/reserved/

The HTTP scheme specific path delegates to the URI spec when it comes to path handling.

The current URI spec (Jan 2005) RFC3986 has no ABNF about path parameters. But RFC3986: Section 3.3. Path has text explaining the history of path parameters, declaring ; and , as opaque at the URI generic level and only meaningful for scheme specific behaviors.
The prior URI spec (Aug 1998) RFC2396 still mentioned support for ; as the separator between the name of the segment and the params, but delegates to the scheme specific behavior for how to handle them.
The older URI spec (June 1995) RFC1808 included path segments supporting ;, but not ,
The even older URL spec (Dec 1994) RFC1738 even mentions http specific behavior that specifically does not have path parameters.

So there was a period of time, from June 1995 through to Aug 1998 where path params in http could have been a thing, but it was ambiguous, as URI said to defer to scheme specific, and the "http" scheme specific delegated back to the URI spec. Neither spec wanting to define it properly. By Jan 2005, the path params at URI level was totally dropped.

But RFC7230: Section 2.7.3. http and https URI Normalization and Comparison says that "http" and "https" conform to the URI generic syntax of RFC3986. Which means what now?
(Generic URI syntax points to scheme specific behavior, and scheme specific points back to generic URI syntax) (ugh)

Supporting non-web schemes (eg: `file`)

Our HttpURI class shouldn't attempt to handle non-web schemes (some examples of non-web are git://, ldap://, ssh://, ftp://, file://, etc). Those scheme specific rules are outside of the scope of our efforts in HttpURI, and URIUtil.

The URIUtil class should have scheme specific behaviors for encode/decode, as the rules for http and file are different. This get a bit easier when we move from Resource to Path, as I would expect to lean heavily on the Path classes to get the encoding/decoding right for FileSystem specific behavior correct (we no longer encode/decode before using Path, like we currently do for Resource).

I strong believe that the HttpURI class shouldn't attempt to handle file:// scheme URIs.

file scheme is defined by RFC8089 - and refers to local filesystem behavior on how to handle the path uri-subcomponent

In Java the validation, encoding, and decoding of a file scheme URI is determined by the FileSystem object. Building a file URI via String manipulation is actually the wrong way to do things, as you are not taking all FileSystem specific behaviors into account. The file scheme spec has some interesting support, eg: it supports authority, and also allows for path parameters, for file-systems that support it (eg: ones that have metadata components on directories, or change access behavior based on provided metadata) For Java, the file scheme supports authority (for referencing filesystem shares), but not path parameters in the ; or , form.
Java instead limits path/file metadata/attributes access to use of the java.nio.file.attribute techniques, which is a subset of all possible attributes a path/file can have.

In your examples of URI and File in comment https://github.com/eclipse/jetty.project/issues/7713#issuecomment-1065308336 are interesting. But doesn't take into account the FileSystem behavior.

The encoding/decoding is FileSystem specific.

Linux java.nio.file.Path delegates to sun.nio.fs.UnixUriUtils
Windows java.nio.file.Path delegates to sun.nio.fs.WindowsUriSupport
OSX java.nio.file.Path delegates to MaxOSXFile* imples, which delegates to BsdFile* impl, which delegates to UnixFile* impls. (depending on specific FileSystem configuration on OSX)

Each of these have rules based on FileSytem behavior (utf normalization, special characters, authority sections, specially named files con/aux/etc.)

joakime commented 2 years ago

I found a corner case with this issue.

request.getPathInContext() and it's use in ResourceService and Resource.resolve(String).

Lets say we have a context path of /context, and we receive a GET request on ...

GET /context/swedish-%C3%A5.txt HTTP/1.1

The current implementation will return ...

HttpURI.getPath          : "/context/swedish-%C3%A5.txt"
HttpURI.getCanonicalPath : "/context/swedish-%C3%A5.txt"
HttpURI.getDecodedPath   : "/context/swedish-å.txt"
Request.getContextInPath : "swedish-å.txt"

But our ResourceService and Resource.resolve(String) expects encoded Strings (aka, what was given to us in the request). So can we have a new Request.getRawContextInPath()?

HttpURI.getPath             : "/context/swedish-%C3%A5.txt"
HttpURI.getCanonicalPath    : "/context/swedish-%C3%A5.txt"
HttpURI.getDecodedPath      : "/context/swedish-å.txt"
Request.getContextInPath    : "swedish-å.txt"
Request.getRawContextInPath : "swedish-%C3%A5.txt"

This Request.getRawContextInPath would need to be smarter than just returning the raw, as-was-submitted, path from the Request line.

It's almost as if we need a URIUtil.decodeUriSafeCharacters(String) which decodes all encoded characters EXCEPT:

the URI spec "reserved" characters.
control characters (bytes 0x00 thru 0x1F, and 0x7F) - these should probably cause an exception
any character that's 8-bits or higher (like unicode)

Things like path parameters should still be processed.

joakime commented 2 years ago

Ah ha! I found the 2 bugs in URIUtil that was causing this oddity.

Stay tuned for PR.

gregw commented 2 years ago

The pathInContext is encoded... but only canonically encoded, i.e. it encodes only characters like %2F.

I expect the problem is that URI is even more picky about what characters and perhaps we should either leave them encoded or encode again before calling URI. %20 is one of these characters that is badly handled by URI.

Let's chat before you do anything too radical in a PR.

On Thu, 11 Aug 2022 at 05:37, Joakim Erdfelt @.***> wrote:

Ah ha! I found the 2 bugs in URIUtil that was causing this oddity.

Stay tuned for PR.

— Reply to this email directly, view it on GitHub https://github.com/eclipse/jetty.project/issues/7713#issuecomment-1211181952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARJLKYBOY7YCKAQ3EZS43VYQAF5ANCNFSM5QIXLANQ . You are receiving this because you were assigned.Message ID: @.***>

-- Greg Wilkins @.***> CTO http://webtide.com

joakime commented 2 years ago

Let's chat before you do anything too radical in a PR.

Yes, lets chat. I fixed those 2 bugs already. And the %u### to utf-8 conversion.

joakime commented 1 year ago

I think we've answered most of these initial questions. Closing this one.

We can open new issues for more specific cases as they come up.

jetty / jetty.project

Jetty-12 URI encoding #7713

URI reserved character basics

gen-delims

sub-delims

Scope of work in Jetty

Supporting web schemes

Supporting non-web schemes (eg: file)

Supporting non-web schemes (eg: `file`)