Closed gregw closed 1 year ago
@joakime I'm not sure we have time to do the resource service this week, but I think you and I should work on this - perhaps I'll do the URI decoding side if you look at the resource service (and rewrite) assuming a canonically encoded path (which I think just means %2F
will remain encoded... hmmm perhaps also %25
so we can deal with %252F
)
Note the other thing we need to think about with regards to resource service, is should the mimetype stuff move from the servlet contexts to the core context - ie allowing each core context to have it's own mapping of extensions to mime type and from mimetype to encodings etc.
If we have core level MimeTypes, does that mean the Servlet level inherits the configuration from core, and then applies it's own MimeTypes on top of it (eg: from the web descriptor)?
We should decode all characters that do not result in a character in the URI reserved
set (aka: delimiters)
From https://datatracker.ietf.org/doc/html/rfc3986#section-2.2
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
Here's an example of using URI Template.
URI-Template: "/api/{branch}/{reference}"
Request: /api/fix%2Fbuffer-overflow/src%2Fmain%2Fjava%2Forg%2Fcompany%2FBuffers.java
Result:
"branch" : "fix%2Fbuffer-overflow"
"reference" : "src%2Fmain%2Fjava%2Forg%2Fcompany%2FBuffers.java"
The encoded form is maintained.
But what if the request is encoded oddly?
Example: all a
characters are encoded to %61
URI-Template: "/api/{branch}/{reference}"
Request: /%61pi/fix%2Fbuffer-overflow/src%2Fm%61in%2Fj%61v%61%2Forg%2Fcomp%61ny%2FBuffers.j%61v%61
Result:
"branch" : "fix%2Fbuffer-overflow"
"reference" : "src%2Fmain%2Fjava%2Forg%2Fcompany%2FBuffers.java"
This means Jetty decodes the request (all pct-encodings that do not result in a reserved character) before handling, the uri-template matching behaves the same.
Some equivalent URI examples.
Are these equivalent?
Example: 1
URI A - /api/foo?y=z
URI B - /api/foo%3Fy=z
Example: 2
/api/foo?y=z?x=y
/api/foo?y=z%3Fx=y
Per https://datatracker.ietf.org/doc/html/rfc3986#section-7.3
7.3 - Back-End Transcoding
Special care should be taken when the URI path interpretation process
involves the use of a back-end file system or related system
functions. File systems typically assign an operational meaning to
special characters, such as the "/", "\", ":", "[", and "]"
characters, and to special device names like ".", "..", "...", "aux",
"lpt", etc. In some cases, merely testing for the existence of such
a name will cause the operating system to pause or invoke unrelated
system calls, leading to significant security concerns regarding
denial of service and unintended data transfer. It would be
impossible for this specification to list all such significant
characters and device names. Implementers should research the
reserved names and characters for the types of storage device that
may be attached to their applications and restrict the use of data
obtained from URI components accordingly.
and https://datatracker.ietf.org/doc/html/rfc3986#section-6.2.2.2
6.2.2.2 - Percent-Encoding Normalization
The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode
octets that do not require percent-encoding, resulting in URIs that
are equivalent to their non-encoded counterparts. These URIs should
be normalized by decoding any percent-encoded octet that corresponds
to an unreserved character, as described in Section 2.3.
We should leave the reserved characters alone when decoding.
I see the following as a sane approach.
We have tiers of information on the request URI
Resulting in accessing the URI in the following APIs.
String Request.getRawURI() // the as-sent URI in the request (no decoding, changes, etc)
HttpURI Request.getDecodedURI() // Performs Unreserved Decode as part of parse (entire URI, all parts before parsing, scheme, host, userinfo, path, query, etc)
String HttpURI.getPath(); // decoded path - only unreserved decoded (no path normalization / path walking)
String HttpURI.getNormalizedPath(); // normalized path ("//", "/../", etc removed)
URI HttpURI.getURI(); // unreserved decoded, not-path-normalized, version
When comparing 2 HttpURI instances, we should support 2 modes. Unreserved decoded form against Unreserved decoded form. (use of "/a/b/../c" and "/a/c" should not be equal) Path normalized against Path normalized. (use of "/a/b/../c" and "/a/c" should be equal)
Normalization is optional behavior in most cases, used for context-path / PathMapping matching, but beyond that the non-normalized path should be seen.
I would expect the following to be true.
HttpURI uriA = new HttpURI("/a/b/../c");
URI uriB = URI.create("/z/../a/c");
assertTrue(uriA.getURI().normalize(), uriB.normalize());
@joakime I'm not exactly on the same page as you... maybe it is naming.
Firstly I don't think we need new Request methods, rather we need to better define the existing methods on HttpURI that is available from the request. We have gone to a lot of effort to ensure that the same information is not available from more than 1 method in request (where possible) to ensure that wrapping is as simple as possible. Hence we need to be very hesitant about adding new methods.
But more importantly, I don't think we can have a "decoded" URI, because if we fully decode we lose information and can have a non-valid API. We can have a "canonical" or "normalized" URI that decodes everything that is safe to decode and perhaps removes dot segments. But it is still a URI and thus is still can have encoded characters.
We can then have the same concept for the path and the query strings (modulo the question of if the canonical path should have all the reserved chars encoded or just the chars specific to paths).
So currently I'm thinking we need to:
HttpURI.getCanonicalURI()
that gives the whole URI in canonical form (not sure if this is a string or another HttpURI?)HttpURI.getCanonicalPath()
that gives the canonical path. perhaps a pathQuery version of this as well.HttpURI.getDecodedPath()
method, as it gives ambiguous results. If apps want the fully decoded path, they can decode the canonical string themselves.Request.getPathInContext()
method to work on the canonical path rather than the decoded path.HttpURI.getCanonicalQuery()
I think "canonical" is better than "normal" but I could probably be convinced the other way.
Per RFC naming ...
HttpURI.getCanonicalURI()
would be defined as full-uri normalization (which is the combination of "6.2.2 : Syntax-Based Normalization", "6.2.2.1 : Case Normalization", and "6.2.2.2 : Percent-Encoding Normalization").HttpURI.getCanonicalPath()
is just "URI -> full-uri normalization -> R.path" - no need for a separate method IMOHttpURI.getDecodedPath()
is "URI -> R.path -> Percent-Encoding Normalization"Request.getPathInContext()
is outside of the scope of the RFC naming. :nerd_face: Interestingly enough we already have String URIUtil.getCanonicalPath(String)
and String URIUtil.getDecodedPath(String)
, with the former doing just dot segment removal and the later doing full % decoding.
They are called moderately frequently, from numerous locations. Thus if we change them, we will need to review the callers in detail.... but then I think we need to do that also.
I would like to change to: getCanonicalPath
will do dot segment removal, safe % decoding and case normalization (%2f to %2F) ; getDecodedPath
will still do full % decoding. But I'll need to carefully consider all the callers first.
However, we need to consider what is safe % decoding. I think reserved chars might be too much:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
As this would mean that a URI like http://host/foo,bar
would need to have the ,
actually encoded by getCanonicalURI()
, else we would still have different strings representing the same URI.
The RFC says (emphasis is mine):
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.
So we only need to protect the characters specific to our scheme. I think a good test is that we should only leave encoded characters that we would not otherwise have to encode to ensure a single canonical version of any string. So for example a URI like http://host/foo,[]=+*:@bar
can have a canonical path of //foo,[]=+*:@
without the need for any of those characters to be encoded. I think it is only /
, ?
, '#and
%that we need to protect for the sake of URIs, but I think we have to add
;` because of servlet path parameter processing. If we left just those characters encoded (with their encoded hex capitalized), then we should have a single canonical path for any URI that can be directly used to build a single canonical URI without any further encoding.
The key thing to remember, is that no matter what, this string is a URI and thus can not safely be passed to anything like File(String)
or even Path(String)
. When looking for a static resource, we will always need to go via something like Paths.get(URI)
with the URI being built from our resource base (as a URI) appended with the canonicalPath. Having a good resource cache is going to be important!
As this would mean that a URI like
http://host/foo,bar
would need to have the,
actually encoded bygetCanonicalURI()
, else we would still have different strings representing the same URI.
That's not what it means.
It means that http://host/foo,bar
is not equivalent to http://host/foo%2cbar
.
Those reference different resources, in all cases, client terms (think cache matching), proxy terms, and server terms. (covered in the RFC).
The canonicalURI method should not encode those reserved characters. In fact it I strongly believe that it should never encode characters. If you pass in nonsense, the HttpURI should reject it as an invalid URI. The canonicalURI method should simply not decode any encoded character that would resolve to a reserved character.
Here's a demo.
package uri;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.stream.Stream;
import org.eclipse.jetty.http.HttpURI;
import org.eclipse.jetty.util.TypeUtil;
import org.eclipse.jetty.util.Utf8StringBuilder;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;
import org.junit.jupiter.params.provider.ValueSource;
import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotEquals;
import static org.junit.jupiter.api.Assertions.assertThrows;
public class RFC3986UriNormalizationTest
{
public static Stream<Arguments> rfcUrlNormalizedEquivalent()
{
return Stream.of(
Arguments.of("HTTP://WEBTIDE.COM/foo", "http://webtide.com/foo"),
Arguments.of("HTTP://WEBTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar"),
Arguments.of("HTTP://W%45BTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar"),
Arguments.of("HTTP://W%45BTIDE.COM/foo/b%61r/../zed", "http://webtide.com/foo/zed")
);
}
@ParameterizedTest
@MethodSource("rfcUrlNormalizedEquivalent")
public void testRfcNormalizedEquivalent(String rawUriA, String rawUriB)
{
URI a = newRFC3986NormalizedUri(rawUriA);
URI b = newRFC3986NormalizedUri(rawUriB);
assertEquals(a, b);
}
public static Stream<Arguments> rfcUrlEquivalent()
{
return Stream.of(
Arguments.of("HTTP://WEBTIDE.COM/foo", "http://webtide.com/foo"),
Arguments.of("HTTP://WEBTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar"),
Arguments.of("HTTP://W%45BTIDE.COM/foo/b%61r", "http://webtide.com/foo/bar")
);
}
@ParameterizedTest
@MethodSource("rfcUrlEquivalent")
public void testRfcDecodedEquivalent(String rawUriA, String rawUriB)
{
URI a = newRFC3986Uri(rawUriA);
URI b = newRFC3986Uri(rawUriB);
assertEquals(a, b, "Equivalent check");
}
public static Stream<Arguments> rfcUrlDifferent()
{
return Stream.of(
Arguments.of("http://host/foo,bar", "http://host/foo%2cbar"), // comma is a reserved character
Arguments.of("http://host/foo%2fbar", "http://host/foo/bar") // slash is a reserved character
);
}
@ParameterizedTest
@MethodSource("rfcUrlDifferent")
public void testRfcDecodedDifferent(String rawUriA, String rawUriB)
{
URI a = newRFC3986NormalizedUri(rawUriA);
URI b = newRFC3986NormalizedUri(rawUriB);
assertNotEquals(a, b, "Different check");
}
public static Stream<String> badUriExamples()
{
return Stream.of(
"/a b/",
"http://foo/a b/",
"ht%74p://foo/a/",
"http://foo/a%A0%A1/", // invalid 2 octet UTF-8 sequence
"a%01b://authority",
"://authority",
"http:"
);
}
@ParameterizedTest
@MethodSource("badUriExamples")
public void testJavaUriNonsense(String rawUri)
{
assertThrows(URISyntaxException.class, () -> new URI(rawUri));
}
@ParameterizedTest
@MethodSource("badUriExamples")
public void testJettyUriNonsense(String rawUri)
{
assertThrows(Exception.class, () -> {
// These URI examples should result in an error (somewhere. either constructor, or via a HttpURI.throwIfInvalid() style)
// Currently doesn't do that.
HttpURI httpUri = new HttpURI(rawUri);
});
}
private static URI newRFC3986NormalizedUri(String rawuri)
{
return newRFC3986Uri(rawuri).normalize();
}
private static URI newRFC3986Uri(String rawuri)
{
return URI.create(decodeUnreserved(rawuri));
}
private static String decodeUnreserved(String str)
{
if ((str == null) || (str.length() == 0))
return null;
int idx = str.indexOf('%');
if (idx == -1)
{
// no hits
return str;
}
int len = str.length();
Utf8StringBuilder ret = new Utf8StringBuilder(len);
ret.append(str, 0, idx);
for (int i = idx; i < len; i++)
{
char c = str.charAt(i);
if (c == '%')
{
if ((i + 2) < len)
{
char u = str.charAt(i + 1);
char l = str.charAt(i + 2);
char result = (char)(0xff & (TypeUtil.convertHexDigit(u) * 16 + TypeUtil.convertHexDigit(l)));
if (isReservedChar(result))
{
ret.append('%');
}
else
{
ret.append(result);
i += 2;
}
}
else
{
throw new IllegalArgumentException("Bad URI % encoding");
}
}
else
{
ret.append(c);
}
}
return ret.toString();
}
private static boolean isReservedChar(char c)
{
/*
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
*/
return (c == ':') || (c == '/') || (c == '?') || (c == '#') || (c == '[') || (c == ']') || (c == '@')
|| (c == '!') || (c == '$') || (c == '&') || (c == '\'') || (c == '(') || (c == ')')
|| (c == '*') || (c == '+') || (c == ',') || (c == ';') || (c == '=');
}
}
@joakime I don't think you have interpreted the RFC correctly. Specifically file:/tmp/foo%2cbar
and file:/tmp/foo,bar
are indeed the same resource.
From RFC 3986 2.4 When to Encode or Decode:
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.
The character ,
is not significant to the scheme-specific dereferencing for neither http
nor file
, so by the RFC it can be safely decoded and will not be mistaken for a component delimiter.
This is different to file:/tmp/foo?bar
and file:/tmp/foo%3Fbar
because the ?
is significant as a component delimiter and when given as ?
means the start of the query string, but if given as %3F
then the ?
is part of the path segment.
Doing the same logic with ,
and %2c
means that if the URI contains just ,
then it is part of the segment as it is not a delimiter, and when given as %2c
it's decoded value is also part of the path segment.
This is illustrated by the following two URIs that hit the same resource on github:
This is further illustrated by the following code:
@Test
public void testURI() throws Exception
{
URI encoded = new URI("file:/tmp/foo%2cbar");
System.err.printf("encoded=%s%n", encoded);
System.err.printf("Paths.get(encoded)=%s%n", Paths.get(encoded));
System.err.printf("new File(encoded)=%s%n", new File(encoded));
URI decoded = new URI("file:/tmp/foo,bar");
System.err.printf("decoded=%s%n", decoded);
System.err.printf("Paths.get(decoded)=%s%n", Paths.get(decoded));
System.err.printf("new File(decoded)=%s%n", new File(decoded));
File file = new File("/tmp/foo,bar");
if (file.exists())
file.delete();
OutputStream out = new FileOutputStream(file);
out.write("Hello World\n".getBytes(StandardCharsets.UTF_8));
out.close();
BufferedReader in = new BufferedReader(new FileReader(new File(encoded)));
System.err.println(in.readLine());
}
That produces the output:
encoded=file:/tmp/foo%2cbar
Paths.get(encoded)=/tmp/foo,bar
new File(encoded)=/tmp/foo,bar
decoded=file:/tmp/foo,bar
Paths.get(decoded)=/tmp/foo,bar
new File(decoded)=/tmp/foo,bar
Hello World
Note the JVM has decoded the %2C
to a ,
even before the file exists, so there is no canonical/real filename processing happening.
We cannot leave %2C
encoded in our canonical URI handling, else we will have two string paths referencing the same resource and thus our path will not be canonical.
I believe that in a canonical URI, that we need leave only %
, /
, ?
, #
, and ;
of the separated characters encoded because there is no possibility of any of those characters being included in a URI as anything other than separators, without being encoded. This means that we will never need to encode any characters in order to obtain a canonical URI.
However, doing some more experiments with the URI class, I can see that they are not exactly consistent with the RFC with regards to what decoded characters they will accept in a URI path. I've generated the following list of characters that the JVM URI class will accept as part of the path:
00: file:/foo bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
01: file:/foobar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
...
1f: file:/foobar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
20: file:/foo bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
21: file:/foo!bar -> encode=false : /foo!bar
22: file:/foo"bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo"bar
23: file:/foo#bar -> encode=true : /foo
24: file:/foo$bar -> encode=false : /foo$bar
25: file:/foo%bar -> encode=false : /foo%bar
26: file:/foo&bar -> encode=false : /foo&bar
27: file:/foo'bar -> encode=false : /foo'bar
28: file:/foo(bar -> encode=false : /foo(bar
29: file:/foo)bar -> encode=false : /foo)bar
2a: file:/foo*bar -> encode=false : /foo*bar
2b: file:/foo+bar -> encode=false : /foo+bar
2c: file:/foo,bar -> encode=false : /foo,bar
2d: file:/foo-bar -> encode=false : /foo-bar
2e: file:/foo.bar -> encode=false : /foo.bar
2f: file:/foo/bar -> encode=false : /foo/bar
30: file:/foo0bar -> encode=false : /foo0bar
31: file:/foo1bar -> encode=false : /foo1bar
32: file:/foo2bar -> encode=false : /foo2bar
33: file:/foo3bar -> encode=false : /foo3bar
34: file:/foo4bar -> encode=false : /foo4bar
35: file:/foo5bar -> encode=false : /foo5bar
36: file:/foo6bar -> encode=false : /foo6bar
37: file:/foo7bar -> encode=false : /foo7bar
38: file:/foo8bar -> encode=false : /foo8bar
39: file:/foo9bar -> encode=false : /foo9bar
3a: file:/foo:bar -> encode=false : /foo:bar
3b: file:/foo;bar -> encode=false : /foo;bar
3c: file:/foo<bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo<bar
3d: file:/foo=bar -> encode=false : /foo=bar
3e: file:/foo>bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo>bar
3f: file:/foo?bar -> encode=true : /foo
40: file:/foo@bar -> encode=false : /foo@bar
41: file:/fooAbar -> encode=false : /fooAbar
42: file:/fooBbar -> encode=false : /fooBbar
43: file:/fooCbar -> encode=false : /fooCbar
44: file:/fooDbar -> encode=false : /fooDbar
45: file:/fooEbar -> encode=false : /fooEbar
46: file:/fooFbar -> encode=false : /fooFbar
47: file:/fooGbar -> encode=false : /fooGbar
48: file:/fooHbar -> encode=false : /fooHbar
49: file:/fooIbar -> encode=false : /fooIbar
4a: file:/fooJbar -> encode=false : /fooJbar
4b: file:/fooKbar -> encode=false : /fooKbar
4c: file:/fooLbar -> encode=false : /fooLbar
4d: file:/fooMbar -> encode=false : /fooMbar
4e: file:/fooNbar -> encode=false : /fooNbar
4f: file:/fooObar -> encode=false : /fooObar
50: file:/fooPbar -> encode=false : /fooPbar
51: file:/fooQbar -> encode=false : /fooQbar
52: file:/fooRbar -> encode=false : /fooRbar
53: file:/fooSbar -> encode=false : /fooSbar
54: file:/fooTbar -> encode=false : /fooTbar
55: file:/fooUbar -> encode=false : /fooUbar
56: file:/fooVbar -> encode=false : /fooVbar
57: file:/fooWbar -> encode=false : /fooWbar
58: file:/fooXbar -> encode=false : /fooXbar
59: file:/fooYbar -> encode=false : /fooYbar
5a: file:/fooZbar -> encode=false : /fooZbar
5b: file:/foo[bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo[bar
5c: file:/foo\bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo\bar
5d: file:/foo]bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo]bar
5e: file:/foo^bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo^bar
5f: file:/foo_bar -> encode=false : /foo_bar
60: file:/foo`bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo`bar
61: file:/fooabar -> encode=false : /fooabar
62: file:/foobbar -> encode=false : /foobbar
63: file:/foocbar -> encode=false : /foocbar
64: file:/foodbar -> encode=false : /foodbar
65: file:/fooebar -> encode=false : /fooebar
66: file:/foofbar -> encode=false : /foofbar
67: file:/foogbar -> encode=false : /foogbar
68: file:/foohbar -> encode=false : /foohbar
69: file:/fooibar -> encode=false : /fooibar
6a: file:/foojbar -> encode=false : /foojbar
6b: file:/fookbar -> encode=false : /fookbar
6c: file:/foolbar -> encode=false : /foolbar
6d: file:/foombar -> encode=false : /foombar
6e: file:/foonbar -> encode=false : /foonbar
6f: file:/fooobar -> encode=false : /fooobar
70: file:/foopbar -> encode=false : /foopbar
71: file:/fooqbar -> encode=false : /fooqbar
72: file:/foorbar -> encode=false : /foorbar
73: file:/foosbar -> encode=false : /foosbar
74: file:/footbar -> encode=false : /footbar
75: file:/fooubar -> encode=false : /fooubar
76: file:/foovbar -> encode=false : /foovbar
77: file:/foowbar -> encode=false : /foowbar
78: file:/fooxbar -> encode=false : /fooxbar
79: file:/fooybar -> encode=false : /fooybar
7a: file:/foozbar -> encode=false : /foozbar
7b: file:/foo{bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo{bar
7c: file:/foo|bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo|bar
7d: file:/foo}bar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foo}bar
7e: file:/foo~bar -> encode=false : /foo~bar
7f: file:/foobar -> encode=true : java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
Which indicates that we need to leave encoded the following characters:
00: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
01: java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
...
1f: java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
20: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo bar
22: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo"bar
23: file:/foo#bar
3c: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo<bar
3e: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo>bar
3f: file:/foo?bar
5b: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo[bar
5c: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo\bar
5d: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo]bar
5e: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo^bar
60: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo`bar
7b: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo{bar
7c: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo|bar
7d: java.net.URISyntaxException: Illegal character in path at index 9: file:/foo}bar
7f: java.net.URISyntaxException: Illegal character in path at index 9: file:/foobar
The other characters are accepted in URI paths, so either we need to force them to be always decoded or safely decode them knowing that the URI class will handle them OK.
From RFC 3986 2.4 When to Encode or Decode:
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.
The character
,
is not significant to the scheme-specific dereferencing for neitherhttp
norfile
, so by the RFC it can be safely decoded and will not be mistaken for a component delimiter. This is different tofile:/tmp/foo?bar
andfile:/tmp/foo%3Fbar
because the?
is significant as a component delimiter and when given as?
means the start of the query string, but if given as%3F
then the?
is part of the path segment. Doing the same logic with,
and%2c
means that if the URI contains just,
then it is part of the segment as it is not a delimiter, and when given as%2c
it's decoded value is also part of the path segment.
"scheme-specific" is important in that statement. The IANA registered schemes - https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml The schemes "http" and "https" are registered at https://www.rfc-editor.org/rfc/rfc8615.html
Lets look at https://www.rfc-editor.org/rfc/rfc8615.html#section-3
This specification updates the "http" [RFC7230] and "https" [RFC7230] schemes to support well-known URIs. Other existing schemes can use the appropriate process for updating their definitions; for example, [RFC8307] does so for the "ws" and "wss" schemes. The "Uniform Resource Identifier (URI) Schemes" registry tracks which schemes support well-known URIs; see Section 5.2.
So it defers to RFC7230, which has a section about the Scheme "http" and "https" registration at at https://www.rfc-editor.org/rfc/rfc7230#section-8.2
Which points to https://www.rfc-editor.org/rfc/rfc7230#section-2.7.1 - http URI Scheme
Which then covers how to deal with path, normalization, and equivalency at https://www.rfc-editor.org/rfc/rfc7230#section-2.7.3
The scheme and host are case-insensitive and normally provided in lowercase; all other components are compared in a case-sensitive manner. Characters other than those in the "reserved" set are equivalent to their percent-encoded octets: the normal form is to not encode them (see Sections 2.1 and 2.2 of [RFC3986]).
Now back to RFC3986 and encode/decode ...
Section 2.4 is a generic statement about when to encode/decode. Section 2.3 is a generic statement about what to encode/decode. Section 6 is a more specific section on how to decode.
Reserved characters are golden (per RFC3986 Section 2.3, and RFC3986 Section 6, and RFC7230: Section 2.7.3). Jetty (as the container) should never encode/decode them, for any reason. Only the application can make the call on decoding them (or not).
Take this for example ...
These are not equivalent, and don't point to the same resource.
w3c access normal slash
$ curl -v --http1.1 http://www.w3.org/Addressing/URL/uri-spec.html | head 25
head: cannot open '25' for reading: No such file or directory
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 128.30.52.100:80...
* TCP_NODELAY set
* Connected to www.w3.org (128.30.52.100) port 80 (#0)
> GET /Addressing/URL/uri-spec.html HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< date: Fri, 11 Mar 2022 17:45:50 GMT
< last-modified: Wed, 01 Feb 1995 20:02:14 GMT
< etag: "e96e-2d004c9973d80"
< accept-ranges: bytes
< content-length: 59758
< cache-control: max-age=21600
< expires: Fri, 11 Mar 2022 23:45:50 GMT
< vary: Accept-Encoding,upgrade-insecure-requests
< content-type: text/html; charset=utf-8
< x-backend: www-mirrors
< x-request-id: 622b8ace8c755d8e
<
{ [1058 bytes data]
* Failed writing body (142 != 1448)
9 59758 9 5402 0 0 45779 0 0:00:01 --:--:-- 0:00:01 45779
* Closing connection 0
curl: (23) Failed writing body (142 != 1448)
w3c access escaped reserved slash
$ curl -v --http1.1 http://www.w3.org/Addressing/URL%2Furi-spec.html | head 25
head: cannot open '25' for reading: No such file or directory
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 128.30.52.100:80...
* TCP_NODELAY set
* Connected to www.w3.org (128.30.52.100) port 80 (#0)
> GET /Addressing/URL%2Furi-spec.html HTTP/1.1
> Host: www.w3.org
> User-Agent: curl/7.68.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< date: Fri, 11 Mar 2022 17:46:45 GMT
< vary: upgrade-insecure-requests
< content-length: 6715
< content-type: text/html;charset=utf-8
< x-backend: www-mirrors
< x-request-id: 622b8b05bab835d7
<
{ [1235 bytes data]
* Failed writing body (1413 != 1448)
61 6715 61 4131 0 0 29091 0 --:--:-- --:--:-- --:--:-- 29091
* Closing connection 0
curl: (23) Failed writing body (1413 != 1448)
These are also not equivalent.
apache access normal slash
$ $ curl -v --http1.1 https://downloads.apache.org//httpd/mod_ftp/mod_ftp-0.9.6-beta.tar.gz.sha1
* Trying 135.181.214.104:443...
* TCP_NODELAY set
* Connected to downloads.apache.org (135.181.214.104) port 443 (#0)
..(snip TLS debug)..
* SSL certificate verify ok.
> GET //httpd/mod_ftp/mod_ftp-0.9.6-beta.tar.gz.sha1 HTTP/1.1
> Host: downloads.apache.org
> User-Agent: curl/7.68.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Date: Fri, 11 Mar 2022 17:39:09 GMT
< Server: Apache
< Last-Modified: Mon, 06 Jul 2020 14:16:58 GMT
< ETag: "44-5a9c68711ec55"
< Accept-Ranges: bytes
< Content-Length: 68
< Access-Control-Allow-Origin: *
< Content-Type: text/plain
<
365fb2eaa72c3e9e3d6a35485cdf685e1b460428 mod_ftp-0.9.6-beta.tar.gz
* Connection #0 to host downloads.apache.org left intact
apache access escaped reserved slash
$ curl -v --http1.1 https://downloads.apache.org//httpd/mod_ftp%2Fmod_ftp-0.9.6-beta.tar.gz.sha1
* Trying 135.181.214.104:443...
* TCP_NODELAY set
* Connected to downloads.apache.org (135.181.214.104) port 443 (#0)
..(snip TLS debug)..
* SSL certificate verify ok.
> GET //httpd/mod_ftp%2Fmod_ftp-0.9.6-beta.tar.gz.sha1 HTTP/1.1
> Host: downloads.apache.org
> User-Agent: curl/7.68.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Mark bundle as not supporting multiuse
< HTTP/1.1 404 Not Found
< Date: Fri, 11 Mar 2022 17:47:58 GMT
< Server: Apache
< Content-Length: 196
< Content-Type: text/html; charset=iso-8859-1
<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
* Connection #0 to host downloads.apache.org left intact
These are not equivalent, and don't point to the same resource.
Exactly! %2F
is meaningful in both http:
and file:
schemes. But %2C
is meaningful in neither.
So we are definitely going to leave %2F
encoded in the canonical uri/path
But I believe that we should decode %2C
in the canonical uri/path, else we create an alias with two URIs referencing the same resource.
The problem comes with how do we systematically determine which characters we should leave encoded. I think the "system" is that we should leave encoded:
/
, ?
, #
because they are separators that we need to delimit the path in a URI%
because we are encoded and thus need to escape the encoding character;
because the servlet specification forces us to deal with path parameters, even though they are no longer formally defined.<= 0x20
, [
, ]
and a few others) because we need to be able to pass our canonical path to the URI class in order to safely access a File via the Path API.It is the last of these that lacks rigour, as I can't see any RFC reason why the URI class accepts :
, ;
, ,
as decoded characters in a path, but not ^
, |
etc. It may be that we need to determine that character set at runtime by testing the URI class (including for non ascii characters since we receive the URI as a string). This set might even differ between JVMs, which would suck.
So alternately, we can come up with our own set of characters that is a super set of what the URI class does not accept, but that is more closely aligned to the RFC, but then we will need to actually encode some characters when making a canonical URI. So if for example we include ,
in our superset, then we will have to encode that to %2C
in our canonical path, so that we never have 2 URIs for the same resource.
Don't forget \
, i.e. %5C
, as in http://host/windows%5Cpath
.
@sbordet that doesn't need to be encoded for the sake of interpreting the URI. It is not ambiguous in a URI path as \
is not significant. It is significant on a windows files style, which is why we should not do new File(base + pathInContext)
, but must instead go via the URI class.
When handling a raw URI, it should be handled in layers.
This is the defined list all the way up at the URI spec (no scheme specifics)
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
These reserved characters are used to separate the uri-subcomponents.
:
uri-subcomponent authority (IP-literal separator, end of host, start of port), end of scheme/
uri-subcomponent indicating start of authority uri-subcomponent, and start of path uri-subcomponent, and separator of path segments.?
uri-subcomponent start of query section#
uri-subcomponent start of fragment[
uri-subcomponent authority IP-literal start (start of host)]
uri-subcomponent authority IP-literal end (end of host)@
uri-subcomponent authority end of userinfo, start of hostThese specific separators are concerned about the first level breakdown of the uri into uri-subcomponents. There spec has language on how to find these general delims (pay attention to grouping and what sub-component you are in currently). No decoding of pct-encoded characters should occur yet at this level.
Example scheme://user%3Ainfo@host/path
, if the userinfo needs to have a :
, then it needs to be encoded.
eg:
:
literal can exist in the host
portion of the authority
as part of an IP-Literal, and in the path, query, and fragment portions as a literal colon.
This ':' is especially tricky (and called out in the various specs) when dealing with relative URI references (eg: this:that
vs ./this:that
)/
literal can exist in the path
portion, and is a sub-delim for path-segment. when seen in other uri-subcomponents it is a literal slash (not a delim).These are characters that have meaning within a specific uri-subcomponent. The rules here, as defined by RFC
Char | sub-component(s) | meaning | example |
---|---|---|---|
! |
path | scheme specific mark character | n/a |
$ |
query | authority/userinfo use is literal, query | n/a |
& |
query | parameter separator | foo://auth/zed?a=b&c=d |
' |
path | scheme specific mark character | n/a |
( |
path | scheme specific mark character | n/a |
) |
path | scheme specific mark character | n/a |
* |
path | scheme specific mark character | n/a |
+ |
scheme & query | in scheme we don't care it's not http or file, in query is a space, everywhere is a plus sign | a+b://auth/?a=b+2 |
, |
path | parameter start and separator | scheme://authority/name,1.1 |
; |
path | parameter start | scheme://authority/name;v=1.1 |
= |
path & query | parameter assignment | /name;v=1.1 and /name?a=b |
The sub-delims designated as "scheme specific mark character" (!
, '
, (
, )
, and *
) are reserved all the way
up at the URI spec level, and have meaning for various scheme specific specs.
Since we are talking about Jetty and more specifically the HttpURI
class we are maintaining, lets focus on the following scheme specific behavior.
http
and https
are defined by RFC8615ws
and wss
are defined by RFC8307 - which delegates to RFC7230 http/https for scheme specific behavior / definition.The schemes I would expect HttpURI
to support are http
, https
, ws
, and wss
.
Example: For the purposes of our HttpURI
class, the sub-delims designated as "scheme specific mark character" (!
, '
, (
, )
, and *
)
have no special meaning with http
, https
, ws
, and wss
schemes, and encoded form is equivalent to decoded form, as they
have no delimitor behavior in these schemes.
As far as path parameters are concerned, they are poorly defined at best, and removed entirely in reality.
I don't think Jetty should attempt to process them.
It might complicate path matching if we don't support at least removing the path params before matching.
This means ;
(semi-colon), and ,
(comma) in the path uri-subcomponent will need to be handled before matching.
We are HTTP, and I think we should not attempt to support path params. Apache Httpd doesn't support them as path params (at least for serving static file content). Example: https://joakim.erdfelt.com/jetty/reserved/raw;path=params/hello.txt You can follow this path (navigating directory listings) from https://joakim.erdfelt.com/jetty/reserved/
The HTTP scheme specific path delegates to the URI spec when it comes to path handling.
;
and ,
as opaque at the URI generic level and only meaningful for scheme specific behaviors.;
as the separator between the name of the segment and the params, but delegates to the scheme specific behavior for how
to handle them.;
, but not ,
So there was a period of time, from June 1995 through to Aug 1998 where path params in http could have been a thing, but it was ambiguous, as URI said to defer to scheme specific, and the "http" scheme specific delegated back to the URI spec. Neither spec wanting to define it properly. By Jan 2005, the path params at URI level was totally dropped.
But RFC7230: Section 2.7.3. http and https URI Normalization and Comparison
says that "http" and "https" conform to the URI generic syntax of RFC3986. Which means what now?
(Generic URI syntax points to scheme specific behavior, and scheme specific points back to generic URI syntax) (ugh)
file
)Our HttpURI
class shouldn't attempt to handle non-web schemes (some examples of non-web are git://
, ldap://
, ssh://
, ftp://
, file://
, etc).
Those scheme specific rules are outside of the scope of our efforts in HttpURI
, and URIUtil
.
The URIUtil
class should have scheme specific behaviors for encode/decode, as the rules for http
and file
are different.
This get a bit easier when we move from Resource
to Path
, as I would expect to lean heavily on the Path
classes to get the encoding/decoding
right for FileSystem specific behavior correct (we no longer encode/decode before using Path
, like we currently do for Resource
).
I strong believe that the HttpURI
class shouldn't attempt to handle file://
scheme URIs.
file
scheme is defined by RFC8089 - and refers to local filesystem behavior on how to handle the path
uri-subcomponentIn Java the validation, encoding, and decoding of a file
scheme URI is determined by the FileSystem object.
Building a file
URI via String manipulation is actually the wrong way to do things, as you are not taking all FileSystem specific behaviors into account.
The file
scheme spec has some interesting support, eg: it supports authority, and also allows for path parameters, for file-systems that support it
(eg: ones that have metadata components on directories, or change access behavior based on provided metadata)
For Java, the file
scheme supports authority (for referencing filesystem shares), but not path parameters in the ;
or ,
form.
Java instead limits path/file metadata/attributes access to use of the java.nio.file.attribute
techniques, which is a subset of all possible attributes a path/file can have.
In your examples of URI
and File in comment https://github.com/eclipse/jetty.project/issues/7713#issuecomment-1065308336 are interesting.
But doesn't take into account the FileSystem behavior.
The encoding/decoding is FileSystem specific.
java.nio.file.Path
delegates to sun.nio.fs.UnixUriUtils
java.nio.file.Path
delegates to sun.nio.fs.WindowsUriSupport
java.nio.file.Path
delegates to MaxOSXFile*
imples, which delegates to BsdFile*
impl, which delegates to UnixFile*
impls. (depending on specific FileSystem configuration on OSX)Each of these have rules based on FileSytem behavior (utf normalization, special characters, authority sections, specially named files con
/aux
/etc.)
I found a corner case with this issue.
request.getPathInContext()
and it's use in ResourceService
and Resource.resolve(String)
.
Lets say we have a context path of /context
, and we receive a GET request on ...
GET /context/swedish-%C3%A5.txt HTTP/1.1
The current implementation will return ...
HttpURI.getPath : "/context/swedish-%C3%A5.txt"
HttpURI.getCanonicalPath : "/context/swedish-%C3%A5.txt"
HttpURI.getDecodedPath : "/context/swedish-å.txt"
Request.getContextInPath : "swedish-å.txt"
But our ResourceService
and Resource.resolve(String)
expects encoded Strings (aka, what was given to us in the request).
So can we have a new Request.getRawContextInPath()
?
HttpURI.getPath : "/context/swedish-%C3%A5.txt"
HttpURI.getCanonicalPath : "/context/swedish-%C3%A5.txt"
HttpURI.getDecodedPath : "/context/swedish-å.txt"
Request.getContextInPath : "swedish-å.txt"
Request.getRawContextInPath : "swedish-%C3%A5.txt"
This Request.getRawContextInPath
would need to be smarter than just returning the raw, as-was-submitted, path from the Request line.
It's almost as if we need a URIUtil.decodeUriSafeCharacters(String)
which decodes all encoded characters EXCEPT:
Things like path parameters should still be processed.
Ah ha! I found the 2 bugs in URIUtil that was causing this oddity.
Stay tuned for PR.
The pathInContext is encoded... but only canonically encoded, i.e. it encodes only characters like %2F.
I expect the problem is that URI is even more picky about what characters and perhaps we should either leave them encoded or encode again before calling URI. %20 is one of these characters that is badly handled by URI.
Let's chat before you do anything too radical in a PR.
On Thu, 11 Aug 2022 at 05:37, Joakim Erdfelt @.***> wrote:
Ah ha! I found the 2 bugs in URIUtil that was causing this oddity.
Stay tuned for PR.
— Reply to this email directly, view it on GitHub https://github.com/eclipse/jetty.project/issues/7713#issuecomment-1211181952, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAARJLKYBOY7YCKAQ3EZS43VYQAF5ANCNFSM5QIXLANQ . You are receiving this because you were assigned.Message ID: @.***>
-- Greg Wilkins @.***> CTO http://webtide.com
Let's chat before you do anything too radical in a PR.
Yes, lets chat.
I fixed those 2 bugs already.
And the %u###
to utf-8 conversion.
I think we've answered most of these initial questions. Closing this one.
We can open new issues for more specific cases as they come up.
There has always been an issue with URI handling in the servlet URI and in jetty that we had a choice between working with:
%2F
encodings or encoded.
or..
segmentsFor Jetty-12, I think we should follow the servlet-6.0 example and work with canonically encoded paths - i.e. paths that are substantially decoded, but that leave %2F characters encoded. This would, for example, allow a resource service to correctly serve a file that has a
/
in it's name rather than as a segment separator.The issue here, is that we need to trace the path all the way through to resource services and default servlets, so that no subsequent handling (after security constraints) decodes the %2F and treats it like a file separator... so that filenames like
/foo/bar%2F..%2F..%2F../etc/passwd
do not become a problem. Note that we will keep our URI compliance mechanisms so servers can reject all such ambiguous URIs and will only allow them into the application if so configured.