Clarify behaviour of getRequestURI(), getContextPath(), getServletPath() and getPathInfo()

jakartaee / servlet

Jakarta Servlet

https://eclipse.org/ee4j/servlet

Other

270 stars 87 forks source link

Clarify behaviour of getRequestURI(), getContextPath(), getServletPath() and getPathInfo() #18

Open glassfishrobot opened 13 years ago

glassfishrobot commented 13 years ago

The specification is unclear for some or all of the above methods how the following should be handled.

path parameters - should they be included in the returned values or not?
normalisation - should the returned values be pre- or -post any normalisation?
url decoding - should the returned values be decoded or include the escapes?
character encoding - if a value is decoded, what character encoding should be used?
welcome files - if a request maps to a welcome file, should it be included in the return values?

glassfishrobot commented 6 years ago

Issue Imported From: https://github.com/javaee/servlet-spec/issues/18
Original Issue Raised By:@glassfishrobot
Original Issue Assigned To: @glassfishrobot

glassfishrobot commented 13 years ago

@glassfishrobot Commented Reported by markt_asf

glassfishrobot commented 13 years ago

@glassfishrobot Commented markt_asf said: Add getRequestURL() to the methods to be reviewed

glassfishrobot commented 7 years ago

@glassfishrobot Commented This issue was imported from java.net JIRA SERVLET_SPEC-18

markt-asf commented 6 years ago

An additional point requiring clarification is that RFC 3986 (section 2.2) states that %nn encoded reserved characters are not equivalent to their decoded forms. i.e. "/foo/bar/" and "/foo%2Fbar" are not equivalent. This raises an additional question. When a %nn reserved character appears in any part of the URI should it be decoded or not in the return value of each of the above methods? I'm leaning towards a solution that leaves them in their %nn form and adds a utility method to the API that performs %nn decoding.

gregw commented 3 years ago

the servlet spec in section 3.1 says:

Path parameters that are part of a GET request (as defined by HTTP/1.1) are not exposed by these APIs. They must be parsed from the String values returned by the getRequestURI method or the getPathInfo method.

I think when those words were written URIs were defined by RFC2396 which well defined path parameters in section 3.3:

      path          = [ abs_path | opaque_part ]
      path_segments = segment *( "/" segment )
      segment       = *pchar *( ";" param )
      param         = *pchar
      pchar         = unreserved | escaped |
                      ":" | "@" | "&" | "=" | "+" | "$" | ","

However, that RFC has since been obsoleted by RFC3986 which now no longer has a normative definition for path parameters and instead just says:

Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.

So this gives us two problems. Firstly I now cannot find any current specification of HTTP URL path parameters and how they should be parsed. So I think the servlet spec could be relying on the grandfathered definition from RFC2396. IS this true? Does anybody know of a current specification for them?

Secondly, there are URIs that if RFC3986 is applied to (ie normalisation, then decoding and param handling) then the result is very ambiguous when received as a string. Examples include:

URI	RFC3986 normalised	common error
/foo/%2e%2e/bar	/foo/../bar	/bar
/foo/aaa%2fbbb/../bar	/foo/bar	/foo/aaa/bar
/foo/..;/bar	/foo/../bar	/bar

markt-asf commented 3 years ago

I'd love to see us make some progress on this in the next iteration of the Servlet spec.

There are a lot of edge cases and interdependencies between edge cases as well as places where the meaning RFCs may not be immediately clear and/or has been misinterpreted in the past. We have a wiki so my suggestion is that we draft some text so we have a common, agreed understanding in the wiki, with examples, and then discuss the changes we wish to make to the specification document and/or the Javadoc so that the specification is aligned with that common understanding. I'll get started on that Wiki page and I'll post again here when I have something ready to start discussing.

markt-asf commented 3 years ago

Wiki page has been created. I think the first thing to tackle is path parameters.

https://github.com/eclipse-ee4j/servlet-api/wiki/HTTP-URIs-and-Servlet-API-methods

gregw commented 3 years ago

I think we also need to consider security issues that are introduced by ill defined path parameters. Jetty recently "improved" our URI handling to be more compliant with RFC3986 and that just resulted in a bunch of CVEs as it created many issues.

For example in the URI /foo/..;/bar the middle segment is not .. so by strict RFC3986 it should not be normalized. But once we strip out the path parameter the resulting path has a pure .. segment, so it is normalized by the file system. So for example a request to /foo/..;/WEB-INF/web.xml was not blocked as it's path started with /foo not /WEB-INF, but when the path was given to the file system, the web.xml was served. There are similar problems with encoded dots and / if you strictly respect RFC3986 (eg /foo/%2e%2e/WEB-INF/web.xml & /foo%2f../WEB-INF/web.xml) so we have reverted to not following RFC3986, specifically in that we decode and then normalize. This is better for security, but it is not good that we do not follow the RFC for URI handling.

Fundamentally getServletPath() and getPathInfo() are ambiguous when they return strings that may include decoded reserved characters or removed parameters.

markt-asf commented 3 years ago

Agreed wholeheartedly re security issues. It is also important we clarify these ambiguities and get consistent behaviour between containers because inconsistent behaviour between containers can also be a source of vulnerabilities (if an app coded for one container works securely on that container it may not be secure on a different container). With respecting to %nn encoding, I was planning on looking at that after path parameters. While the two aren't independent, I think we can look at path parameters in isolation first and then, once we have a plan for path parameters, see what impact %nn encoding has on that plan. For path parameters, I've done some testing along these lines:

URI uriC = new URI("file:///home/aaa/..;/mark");
uriB = uriC.normalize();
File fileC = new File(uriC);
System.out.println(uriC.toString());
System.out.println(fileC.isDirectory());

The result is:

file:///home/aaa/..;/mark
false

so my conclusion is that if we retain path parameters, we have to retain them all the way through the processing chain and if we do that the request will fail because the context path doesn't match, the servlet path doesn't match or the file doesn't exist. That is the second option in the wiki page option. The other option is to remove them early on in the processing. I've added a comparison of the options to the wiki.

Despite arguing consistently for option 1 on the Tomcat lists for a number of years (and implementing it) I'm beginning to think that option 2 is the better option.

gregw commented 3 years ago

@markt-asf I'm not entirely sure that we can separate discussion of ; from issues associated with other difficult characters like %, #, ?, ., / and \. But I'm happy to try primarily focusing on the issues of ; and using other characters as sanity checks.

I think an exercise that would be good to help clarify how these characters should be handled in received URIs is to better define what path is in the ServletContext.getResource(String path) API. It is entirely unclear to me if it is an encoded URI path, a decoded URI path or in the file system space. Defining exactly what space this string is in will help us define getServletPath() and getPathInfo() because I think they need to return a string in the same space.

If the path parameter is a decoded URI path then:

getResource("/foo;oo/bar.txt") will return a URL of something like file://path/to/webroot/foo%3boo/bar.txt, ie a file called "bar.txt" in a directory called "foo;oo"
If there is a file called "foo/bar.txt" where the / is a character in the name and not a separator (possible on some, but not all file systems) then that resource cannot be accessed by getResource as getResource("/foo/bar.txt") will look for "bar.txt" in "/foo" and getResource("/foo%2fbar.txt") will look for a file with a % in its name (equiv to file://path/to/webroot/foo%252fbar.txt with the % itself % encoded).
On windows implementations will need to be careful that getResource("foo\bar.txt") does not find "bar.txt" in directory "foo", but instead looks for a file with the \ in it's name.

Fundamentally, the problem with the decoded URI path string space is that once decoded, you can't tell the difference between a URIs /foo/bar and /foo%2fbar. Thus if we leave in parameters, we have the same problem that it is impossible to distinguish between URIs /foo;oo/bar and /foo%3boo/bar. Thus the strings returned from the path methods and thosed passed to the getResource method are not sufficient to identify all possible resources as they have discarded some information.

If we want to be able to access all resources, the getResource(String) needs to take an encoded URI path. But then it is different to the path methods, which are documented as returning decoded strings, so we would make applications vulnerable to double decode attacks like /foo/%25./WEB-INF/secret.jsp. It only makes sense to do this if the path methods also return strings in the same space so that ServletContext.getResource(request.getPathInfo()) is a safe operation. But that is a big change that would break many/most applications that do not expect they have to do their own % encoding.

One way forward I see is:

embrace the partial space created by decoded URI paths and well define which resources are not accessible (e.g. files with ;, %, / and \ in their names cannot be accessed by normal requests)
Any request that contains a potentially ambiguous URI (eg encoded ; / or \) is by default rejected with a 400.
An application can declare that it understands partially decoded URI paths. For such applications, the path methods will return mostly decoded strings, but specific reserved characters will remain encoded if needed to disambiguate the path. The getResource method will likewise accept just those reserved characters as encoded if they are to be part of the file name or decoded if they are to be interpreted as part of the URI

markt-asf commented 3 years ago

I agree that path in ServletContext.getResource(String path) should be consistent with the values returned for getServletPath() and getPathInfo().

As I have got my head around the evolution of path parameters in the URI RFCs from the very specific in RFC 1808 to the very generic in RFC 3986, I am now in agreement with you that we need to approach path parameters and %nn decoding in combination.

I have a way forward in mind that I think is very similar to the way forward you expressed.

I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:

The original path component is stored for use as the return value for getRequestURI(). (This value is neither normalized nor decoded.)
Any jsessionid path paraemeter present is parsed, stored and removed from the path. This makes the use of URL re-writing for session tracking mostly transparent to the application.
Unreserved characters are %nn decoded.
Exact "/../" and "/./" sequences are normalized. As per the RFC 3986 examples, "/aaa/bbb;c=d/../eee" will be normalized to "/aaa/eee". "/..;/" would be left as-is.
The normalized path (still containing path parameters and %nn encoded reserved characters) is mapped to provide contextPath, servletPath and pathInfo. These are normalized but encoded.
Three new methods getContextPathEncoded(), getServletPathEncoded() and getPathInfoEncoded() provide access to these normalized, encoded values.
Another three new methods getContextPathDecoded(), getServletPathDecoded() and getPathInfoDecoded() take the respective normalized and encoded values. If an unencoded reserved character is present (i.e. path parameters are present) or if an encoded '/' is present, null is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.
The default servlet is expected to use the getXXXDecoded() methods to map the URI to the file system
path in ServletContext.getResource(String path) is equivalent to getXXXDecoded()
The current getContextPath(), getServletPath() and getPathInfo() methods are deprecated with a note that their behaviour was ambiguous regarding normalization and decoding and that users should refer to the documentation for their container to determine the actual behaviour of these methods for that container.

This means files that use any reserved character other than '/' (which isn't a valid character for a file name on either Linux or Windows) in their names will be accessible via a normal request.

Applications and frameworks that want to use path parameters can do so via the getXXXEncoded() methods without risk of double decoding (because when we %nn decode we only decode unreserved characters).

What I'd really like to do is to avoid deprecating getContextPath(), getServletPath() and getPathInfo() as they are so fundamental to the API. However, if we use them instead of getXXXDecoded(), the nulls will likely break stuff. It we use them instead of getXXXEncoded() the unexpected encoding will likely break stuff.

One possibility would be for the application to declare which reserved characters it wished to use as delimiters. '/' would always be a delimiter. The steps above then change to:

...
Unreserved characters and reserved characters not declared to be used by the application characters are %nn decoded.
Exact "/../" and "/./" sequences are normalized. As per the RFC 2986 examples, "/aaa/bbb;c=d/../eee" will be normalized to "/aaa/eee".
The normalized path (still containing path parameters and %nn encoded reserved characters declared to be used by the application) is mapped to provide contextPath, servletPath and pathInfo. These are normalized but encoded.
Three new methods getContextPathEncoded(), getServletPathEncoded() and getPathInfoEncoded() provide access to these normalized, encoded values.
getContextPath(), getServletPath() and getPathInfo() take the respective normalized and encoded values. If an unencoded reserved character declared to be used by the application is present (i.e. path parameters are present) or if an encoded '/' is present, null is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.
The default servlet is expected to use getServletPath() and getPathInfo() to map the URI to the file system
path in ServletContext.getResource(String path) is equivalent to getPathInfo()

%25 is an edge case in the above that would need some more thinking about in this appraoch to avoid double decoding issues but I think it is solveable.

gregw commented 3 years ago

I agree with a lot you say... but a few points inline...

I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:

The original path component is stored for use as the return value for getRequestURI(). (This value is neither normalized nor decoded.)

Any jsessionid path paraemeter present is parsed, stored and removed from the path. This makes the use of URL re-writing for session tracking mostly transparent to the application.

Unreserved characters are %nn decoded.

We should also specifically prohibit %uXXXX decoding. This was a rejected proposal, but there are still some usages of it, so we still offer it as a legacy mode. Would be good to kill that completely.

Exact "/../" and "/./" sequences are normalized. As per the RFC 3986 examples, "/aaa/bbb;c=d/../eee" will be normalized to "/aaa/eee". "/..;/" would be left as-is.

The normalized path (still containing path parameters and %nn encoded reserved characters) is mapped to provide contextPath, servletPath and pathInfo. These are normalized but encoded.

Three new methods getContextPathEncoded(), getServletPathEncoded() and getPathInfoEncoded() provide access to these normalized, encoded values.

The existing getContextPath() actually returns the encoded context path! (see javadoc that says: "The container does not decode this string"). Which is good for the number of new methods needed, but is going to make naming of new methods a nightmare!

The other thing to consider is do we really need the decoded path broken down into servletPath and pathInfo? I see so much code that is adding those back together. How about we just provide a new getEncodedPathInContext() method that returns servletPath+pathInfo ? Actually since this is a new method and we already have a mix of encoded and decoded return types, we could just call it getPathInContext() for simplicity and document that it returns the path with parameters any reserved characters encoded.

Another three new methods getContextPathDecoded(), getServletPathDecoded() and getPathInfoDecoded() take the respective normalized and encoded values. If an unencoded reserved character is present (i.e. path parameters are present) or if an encoded '/' is present, null is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.

I don't like even more new methods, as the API is already very fat and many of these methods might need to be replicated in request attributes, which are already too many! So instead I think we just accept that the existing getServletPath() and getPathInfo() are decoded. Instead I think we should offer some modal behaviour, where the application can declare how it wants them to work. I see the following modes (names are WIP):

LAX they just return the decoded path as they do today with params stripped - thus this mode doesn't break anybody (unless they are already broken when hit with ambiguous URIs)
STRICT they return a decoded path IFF there are no encoded reserved characters nor parameters (other than JSESSIONID which we already remove), otherwise null is returned (or perhaps an exception thrown?)
SAFE the container rejects with a 400 any request with a URI that encodes reserved characters or parameters

The default mode could be LAX if we don't want to break anybody, but I'm happy to allow the container to pick the default mode if a particular webapp does not indicate a preference.

The default servlet is expected to use the getXXXDecoded() methods to map the URI to the file system

Firstly, the default servlet cannot map URIs directly to the file system and must go via a getResource like API so it can access META-INF/resources in fragment jars etc. This also gives it the same behaviour as JspServlets etc. Secondly using the decoded path is not sufficient to resolve the ambiguous URIs like /foo%2fbar. Instead I think we need to provide a new method URI ServletContext.getUri(String encodedPath) and the default servlet needs to serve resources equivalent to URI ServletContext.getUri(request.getPathInContext()). Note this switches to URI rather than the horrid URL class, but I guess it could also be done with Path, or maybe we provide both getPath and getUri methods?

path in ServletContext.getResource(String path) is equivalent to getXXXDecoded() See above... is equivalent to the string returned by get[Servlet]Path[Info]()

The current getContextPath(), getServletPath() and getPathInfo() methods are deprecated with a note that their behaviour was ambiguous regarding normalization and decoding and that users should refer to the documentation for their container to determine the actual behaviour of these methods for that container.

See above. I'd prefer not to deprecate these methods as the vast majority of applications use them and have no need to ever see an path that contains a decoded reserved character. Thus the vast majority of applications will continue to work perfectly well with no code changes even if their container switches to STRICT or SAFE mode. The few applications that do need handle encoded reserved characters are probably working with getRequestURI anyway, but they can be modified to use the new methods if they like.

This means files that use any reserved character other than '/' (which isn't a valid character for a file name on either Linux or Windows) in their names will be accessible via a normal request.

Remember that resources can be served from META-INF/resources in a fragment jar from WEB-INF/lib. These jars can be produced on one OS and run on another and thus could contain all sorts of nasty characters that are not legal in the local filesystem. Also OS's can mount foreign filesystems, so I just don't think we can assume that a resource name will never contain /. I admit that such resources should be vanishingly rare and I can't really see a use for them (other than as some kind of exploit bomb droped into somebody elses WEB-INF/lib. Thus I'm good with an approach that basically just rejects such URIs unless the application declares that it really wants to use them.

Applications and frameworks that want to use path parameters can do so via the getXXXEncoded() methods without risk of double decoding (because when we %nn decode we only decode unreserved characters).

But we do need to provide something like my proposed URI ServletContext.getUri(String encodedPath) and/or a ServletContext.getResourceEncoded(String) method, that handle exactly only those encoded reserved characters and throws IAE if it starts seeing other % encoded characters. i.e. it accepts exactly the strings returned from the new getXxxEncoded() method(s).

What I'd really like to do is to avoid deprecating getContextPath(), getServletPath() and getPathInfo() as they are so fundamental to the API. However, if we use them instead of getXXXDecoded(), the nulls will likely break stuff. It we use them instead of getXXXEncoded() the unexpected encoding will likely break stuff.

One possibility would be for the application to declare which reserved characters it wished to use as delimiters. '/' would always be a delimiter. The steps above then change to:

URI handling should be portable and we really don't need any doubt about what are reserved characters or not, specially as the getResource[En|De]coded style APIs are expected to be fed from the results of the getPath[En|De]coded style APIs.

So in summary, I like the decoding approach you've outlined, but think we can achieve the outcome we want with:

only 2(or 3) new methods: String HttpServletRequest.getPathInContext() and URI ServletContext.getUri(String encodedPath) (and/or Path ServletContext.getPath(String encodedPath)).
Introduce modal behavior for existing get[Servlet]Path[Info]() methods so that they either: act as they do today; return null for ambiguous URIs; are not even possible to call for ambiguous URIs. The other good thing about this approach is that the modes, once defined can be applied by a container to existing applications written against the current servlet-api versions and 99.9% of webapps will work perfectly fine in STRICT or SAFE mode with no code changes.

markt-asf commented 3 years ago

Yes, I agree with pretty much all of that. It is great to feel that some progress is being made in this area.

I wasn't aware %uXXXX was even a possibility. I'd be happy to exclude that and any other weirdness. Something like:

The only permitted encoding scheme for URIs is the pct-encoded scheme defined in RFC 3986.

I'd forgotten that HttpServletRequest.getContextPath() returned the encoded path. I'm surprised given the complete pain that was to implement. Maybe I blocked it out ;). I agree that complicates naming.

Regarding splitting servletPath and pathInfo, the feedback I have from the Spring framework is that they do need them to be separate. I'm not against new methods like getPathInContext() but neither am I convinced it is necessary at this point.

I hear your objection regarding new methods and implications for request attributes. That is a good reason to try and minimise the number of new methods.

I agree the default servlet needs to use something like getResource(). My point was more about what gets passed to that method. I'm aiming for a solution that supports the widest possible set of filenames with the minimum number of special cases that require additional handling. My hope is that we can define behaviour in such a way that as many of the ambiguities as possible are designed out. For example so that using a path parameter (other than jsessionid) with a static resource would result in a 404.

My thinking around allowing applications to effectively remove some characters from the reserved set was to assist with migration with what was going to be a stricter regime. If the consensus is that that would create more ambiguity/problems than it solves, I'm happy to leave that as a potential container specific feature.

I like the idea of allowing the application to operate in different modes. That provides a lot more flexibility without having to add a new set of methods for each mode. Regarding each mode:

LAX: Maybe call this LEGACY? I agree this should be the "same as Servlet 5" mode. My only comment is that getContextPath() won't be decoded in this mode. The behaviour of this mode may be slightly different for different containers depending on how they implemented the methods in prior versions.
DECODED: new mode. Should be very similar to LAX (depending on container behaviour in prior versions) but like getServletPath() and getPathInfo(), getContextPath() also has parameters stripped, is decoded and normalized.
STRICT: I'd prefer returning null rather than throwing an exception if %nn and/or parameters are present as I think this comes under "Use exceptions only for exceptional conditions"
SAFE: I think this is very similar to STRICT (since I'm assuming the application will treat a 'null' being returned in STRICT as an error condition) but moves the rejection of the request earlier in the processing. I'd suggest a 404 rather than a 400 since that suggests the request is not a valid request rather than something the server is chossing not to respond to. I've been trying to think of a use case where STRICT would be chosen over SAFE and I can't. Did you have a use case in mind?
ENCODED: new mode. Non-reserved characters %nn decoded. Exact "/../" and "/./" sequences are normalized. Encoded reserved characters and path parameters remain.

After some further experiments with new File(URI), I agree there is a need for something like URI ServletContext.getUri(String encodedPath) to bridge the gap between the possibly encoded and/or containing parameters path returned by getServletPath() and getPathInfo() and methods like ServletContext.getResource(String) and friends. However, I think it needs to return String so that can then be passed into the getResource methods. I'm not sure about the name but something like String ServletContext.getResourcePath(path) where:

if path contains any reserved characters apart from / the method returns null (this effectivly blocks parameters)
if path contains %2f the method returns null. We could add a boolean parameter to allow this but I can't think of a use case for this but I can think of plenty of security exploits if it were allowed.
decode all remaining %nn sequences
return the remaining string

If the returned string was then used in ServletContext.getResource(String) and friends, it should mean that files that happen to use reserved characters in their paths would be accessible.

If we included the behaviour of ServletContext.getResource(String) and friends in the definition of the operating modes discussed above, we could potentially remove the need for the new conversion method by having the container do this conversion inside ServletContext.getResource(String) and friends. It would only be needed for the new ENCODED mode.

gregw commented 3 years ago

I think the use-case for a mode like STRICT vs SAFE is a webapp that has multiple servlets, some of which that are prepared to deal with encoded paths, but others that have not been updated to do so. In such a mixed mode deployment, the updated servlets could use new methods to access the encoded path, but if any requests with encoded paths arrived at non-updated servlets, then they would either throw or get a null if they used the existing path methods.

Ultimately, this scenario shows the limits of the modal approach, in that the mode really doesn't apply to the whole webapp, but should apply servlet by servlet and filter by filter. However I don't think that fine grained approach would be workable, as what do you do if a request is mapped to a servlet that is ENCODED mode, but also matches a filter that declares it is in DECODED mode. Thus I think the modes need to be container wide. But I also think that having a new API to get the resesrved-only-encoded form would allow filters/servlets to be written that will work correctly regardless of mode.

So I like your modes (and their names), but they only apply to the existing path methods and to the interpretation of the existing getResource(String) methods. We then need a new methods like String HttpServletRequest.getUriPathInContext() and URI ServletContext.getURI(String uriPathInContext) that work on unambiguous reserved-only-encoded forms that will work in exactly the same way regardless of the mode. We then can have:

SAFE is used for 99.99% of applications that don't care about strangely encoded paths
STRICT can be used by applications that have some filters/servlets updated to use new APIs but some filters/servlets that are not updated but need to be protected from strange encodings.
ENCODED can be used by applications that have updated all filters/servlets to be able to handle reserved-only-encoded forms.
LEGACY is for the 0.001% of applications that really need the current bad behaviour
DECODED is the one I can't really see a use-case for, but it's existance is kind of symmetric with ENCODED

So we could get away with just SAFE, STRICT and LEGACY if we wanted to minimize the number of modes.

gregw commented 3 years ago

I've been thinking about the modes and I'd really like to keep them to a minimum.

I think we can get by with just a boolean legacy mode in which the getServletPath and getPathInfo methods return fully decoded paths and the getResource and getRealPath methods expect fully decoded paths as well.

When not in legacy mode then getContextPath, getServletPath and getPathInfo return strings in which only URI reserved characters and ; are % encoded. The getResource and getRealPath methods expect similarly reserved-only encoded strings (any encoded non-reserved character is an IAE).

The behaviours of other modes described above can be obtained by filters or perhaps new security constraint types. To assist with that, it would be good to have a new method boolean HttpServletRequest.isFullyDecoded() that would return true if there are no encoded reserved characters in the URI (name is WIP).

The I still think we need the new String HttpServletRequest.getPathInContext() and URI ServletContext.getURI(String pathInContext) that always work in encoded-non-reserved mode regardless of the legacy mode.

We could even avoid having an explicit legacy mode switch and instead just use the version of the web.xml file (with no web.xml defaulting to 6.0 semantics).

In summary, in a 6.0 webapp, all the following methods would work in only-reserved-characters-encoded mode: getContextPath(), getServletPath(), getPathInfo(), getPathInContext(), getResource(String), getRealPath(String), getURI(String). In legacy mode, the pre-existing methods work as they always did, but the new methods work in only-reserved-characters-encoded mode.

markt-asf commented 3 years ago

I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points.

I agree reducing modes is good. I've no objection to dropping STRICT and SAFE. The more I thought about it, the less comfortable I was with the idea of rejecting valid URIs. LEGACY and DECODED should be very similar. My thinking in adding DECODED was to cover the case where a container's current behaviour was closer to ENCODED. However, I haven't seen any evidence that any container behaves like that so DECODED could be dropped on the basis the use case for it is purely theoretical. That leaves LEGACY and ENCODED as you suggested. I'm not sure about using a boolean to control the mode. I'd like to leave open the possibility of containers adding their own custom mode(s) and/or us adding an additional mode as this discussion evolves. An Enum was my first thought but that isn't extensible. Maybe a string where values that start X- are reserved for container specific extensions and all other names are reserved for the specification.

For LEGACY mode, I think we are in agreement if by "fully decoded paths" you mean:

extract jsessionid if any, removing it from the URI
remove all remaining path parameters
decode everything apart from %2f
normalize
map to Servlet
determine paths based on the Servlet the request was mapped to

For ENCODED mode, the above becomes:

extract jsessionid if any, removing it from the URI
decode all %nn sequences excluding RFC 3986 reserved characters and '%25for%`
normalize
map to Servlet
determine paths based on the Servlet the request was mapped to

Which brings us nicely to the topic of mapping. In LEGACY mode mapping is easy. In ENCODED mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:

ignore path parameters when mapping
introduce some form of URI pattern matching that accounts for path parameters in <url-pattern> elements
only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere
soemthing else

Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like /..;/ could be used for a path traversal attack.

We also need to decide if the vales in <url-pattern> elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here.

I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.

An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages.

I think the final part of this puzzle is HttpServletMapping. If we choose option 1 for mapping requests then that would mean the current behaviour of HttpServletMapping is unchanged.

What was your thinking in requring an IAE if getResource() and friends were called with a path that included a %nn encoded non-reserved character?

I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.

gregw commented 3 years ago

I'll digest your main responses tomorrow.... but I'll answer the easy one below.

What was your thinking in requring an IAE if getResource() and friends were called with a path that included a %nn encoded non-reserved character?

The idea is that those methods should take precisely the strings returned from the path methods. Currently they accept decoded paths, so if there is a % character in the arg, it is considered part of the file name. If we start allowing arbitrary characters to be encoded, then we are more likely to break the few apps that do actually pass % characters. If we only accept % encoding for precisely the reserved characters that the path methods leave encoded then that a) forces us to very well define those characters; b) makes minimal changes to how those methods are used today (for the 99.999% case); c) any modes introduced are exactly the sane for what the path methods return and what the resource methods take.

gregw commented 3 years ago

I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points.

That leaves LEGACY and ENCODED as you suggested. I'm not sure about using a boolean to control the mode. I'd like to leave open the possibility of containers adding their own custom mode(s) and/or us adding an additional mode as this discussion evolves. An Enum was my first thought but that isn't extensible. Maybe a string where values that start X- are reserved for container specific extensions and all other names are reserved for the specification.

How this is handled all depends on where we see the future. Do we want an arbitrary Servlet-9.0 application to be able to continue to choose between the modes for the path methods - ie LEGACY is an entirely valid mode into the future; or is that the future of the path methods is to return the reserve-encoded form that we define in 6.0 and we only offer a legacy mode to ease the transition. I favour the later as carrying legacy modes over multiple major releases will result in the need to version the legacy modes.

So if we say the path methods from 6.0 onwards are intended to return the reserve-encoded form, then I think it is entirely valid to use the existence of a < 6.0 web.xml to trigger the legacy mode. We can then add some verbage to say that a container is free to all the mode to be specifically configured and to override the web.xml. If we go this way, then we don't even need to define named modes a LEGACY and ENCODED. Instead we just define the bahaviour of these methods in 6.0 and note that they behaive differently if there is a <6.0 web.xml present.

For LEGACY mode, I think we are in agreement if by "fully decoded paths" you mean:

extract jsessionid if any, removing it from the URI

remove all remaining path parameters

decode everything apart from %2f

Nope - I think everything is decoded, including %2f. Do you currently leave %2f encoded?

normalize

map to Servlet

determine paths based on the Servlet the request was mapped to

For ENCODED mode, the above becomes:

extract jsessionid if any, removing it from the URI

decode all %nn sequences excluding RFC 3986 reserved characters and '%25for%`

normalize

map to Servlet

determine paths based on the Servlet the request was mapped to

Yep, modulo treatment of ; below.

Which brings us nicely to the topic of mapping. In LEGACY mode mapping is easy. In ENCODED mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:

ignore path parameters when mapping

introduce some form of URI pattern matching that accounts for path parameters in <url-pattern> elements

only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere

soemthing else

Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like /..;/ could be used for a path traversal attack.

What about option 5. Treat ; characters other than as part of a JSESSIONID as just a normal character in a segment? According to the RFCs, path parameters are no longer a thing generally. So we should take out the one for session ID and leave all others in the path. For mapping, we would then allow a filter/servlet to be mapped to something like /foo;bar where ; is treated no differently to any other character. If we really REALLY wanted to do more, then perhaps we could introduce a new mapping like /foo;* which would match any URI like /foo, /foo/, /foo;anything, but not /foo/bar or /foo;anything/bar. We don't support any mid patterns like /foo/*/bar or /foo*/bar already, so I don't see why we should start for parameters.

We also need to decide if the vales in <url-pattern> elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here. +1 And I'd prefer it to be an error if any other characters are encoded.

I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.

Hmmmmm maybe. But I think this just complicates things like asynchronous threads that call the path methods. Already it is a problem that an async thread looking at a request that is passed to a request dispatcher will see different values in a race with the handling of the request in and out of the dispatcher. If we make this per servlet, then an async thread will see different encoding values in a race. The real fix for this is to get rid of object-identity-wrapping so that async threads can be passed a container-wrapped request and the values they see from the path methods would never change.... but that is another discussion.

Ultimately, it comes down to if this mode is something valid going forward or just some legacy help to get applications ported to the new 6.0 behaviour. I think the later, so I think having a context wide mode is fine. If there are really some filters/servlets that can run in that mode then we can provide utility wrappers that will take a 6.0 request running in the new mode and decode any reserved characters to provide the <6.0 behaviour (in fact I was already thinking of providing this mode with a container provided wrapper prior to the first dispatch - hence not making core code horridly modal).

An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages.

I think the new methods I'm proposing can be justified even without this problem. Too frequently are servletPath and pathInfo recombined to a pathInContext. The resource methods really should be usable with URI and/or Path rather than the antique URL. They a good improvements anyway, and I just really a subject in this discussion because they can be specified with the new better behaviour and never need to be modal.

I think the final part of this puzzle is HttpServletMapping. If we choose option 1 for mapping requests then that would mean the current behaviour of HttpServletMapping is unchanged. Or option 5.

I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.

How about prior to implementing in Tomcat we create a branch of the api/spec and a draft PR. We can word craft the spec/javadoc/api changes we are describing above in that PR and that will ensure we are really on the same page (I think we are currently in the same chapter of the same book, so that's good). In parallel we can try out that branch in Tomcat and Jetty.

markt-asf commented 3 years ago

What was your thinking in requiring an IAE if getResource() and friends were called with a path that included a %nn encoded non-reserved character?

The idea is that those methods should take precisely the strings returned from the path methods. Currently they accept decoded paths, so if there is a % character in the arg, it is considered part of the file name. If we start allowing arbitrary characters to be encoded, then we are more likely to break the few apps that do actually pass % characters. If we only accept % encoding for precisely the reserved characters that the path methods leave encoded then that a) forces us to very well define those characters; b) makes minimal changes to how those methods are used today (for the 99.999% case); c) any modes introduced are exactly the sane for what the path methods return and what the resource methods take.

For getResource() and friends, I think the way we map the supplied path to a resource should mirror the way we map a request path to a Servlet. Same normalization, decoding etc. I don't see how an IAE would work with such an approach.

markt-asf commented 3 years ago

How this is handled all depends on where we see the future. Do we want an arbitrary Servlet-9.0 application to be able to continue to choose between the modes for the path methods - ie LEGACY is an entirely valid mode into the future; or is that the future of the path methods is to return the reserve-encoded form that we define in 6.0 and we only offer a legacy mode to ease the transition. I favour the later as carrying legacy modes over multiple major releases will result in the need to version the legacy modes.

So if we say the path methods from 6.0 onwards are intended to return the reserve-encoded form, then I think it is entirely valid to use the existence of a < 6.0 web.xml to trigger the legacy mode. We can then add some verbage to say that a container is free to all the mode to be specifically configured and to override the web.xml. If we go this way, then we don't even need to define named modes a LEGACY and ENCODED. Instead we just define the bahaviour of these methods in 6.0 and note that they behave differently if there is a <6.0 web.xml present.

I'm happy to treat LEGACY mode as a transitional mode. Given that each container probably does something slightly different at the moment, I think there are multiple benefits to treating legacy compatibility - including how it is enabled - as a container specific extension.

For LEGACY mode, I think we are in agreement if by "fully decoded paths" you mean:

extract jsessionid if any, removing it from the URI

remove all remaining path parameters

decode everything apart from %2f

Nope - I think everything is decoded, including %2f. Do you currently leave %2f encoded?

Yes. We have had various requests for this. Most recently: https://bz.apache.org/bugzilla/show_bug.cgi?id=62459 The short version is that there needs to be a way to pass / where it isn't treated as the segment delimiter. %2f is the only way to do that.

But since this is LEGACY mode then assuming that we agree to leave legacy support as container specific then different containers continuing to take different approaches is OK.

Which brings us nicely to the topic of mapping. In LEGACY mode mapping is easy. In ENCODED mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:

ignore path parameters when mapping

introduce some form of URI pattern matching that accounts for path parameters in <url-pattern> elements

only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere

something else

Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like /..;/ could be used for a path traversal attack.

What about option 5. Treat ; characters other than as part of a JSESSIONID as just a normal character in a segment? According to the RFCs, path parameters are no longer a thing generally.

I have a very different reading of the RFCs. My reading is that they RFCs have been getting increasingly generic in what is allowed as a path parameter. See the final paragraph of RFC 3986, section 3.3

So we should take out the one for session ID and leave all others in the path. For mapping, we would then allow a filter/servlet to be mapped to something like /foo;bar where ; is treated no differently to any other character. If we really REALLY wanted to do more, then perhaps we could introduce a new mapping like /foo;* which would match any URI like /foo, /foo/, /foo;anything, but not /foo/bar or /foo;anything/bar. We don't support any mid patterns like /foo/*/bar or /foo*/bar already, so I don't see why we should start for parameters.

Path parameters aren't common, but where they are used, typical use cases are more complex than this. For example; /foo;a=1;b=2/bar. The Spring docs have more examples. Given the current usages, I can't see a solution, other than option 1, that would work.

We also need to decide if the vales in <url-pattern> elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here. +1 And I'd prefer it to be an error if any other characters are encoded.

Agreed.

I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.

Hmmmmm maybe. But I think this just complicates things like asynchronous threads that call the path methods. Already it is a problem that an async thread looking at a request that is passed to a request dispatcher will see different values in a race with the handling of the request in and out of the dispatcher. If we make this per servlet, then an async thread will see different encoding values in a race. The real fix for this is to get rid of object-identity-wrapping so that async threads can be passed a container-wrapped request and the values they see from the path methods would never change.... but that is another discussion.

Ultimately, it comes down to if this mode is something valid going forward or just some legacy help to get applications ported to the new 6.0 behaviour. I think the later, so I think having a context wide mode is fine. If there are really some filters/servlets that can run in that mode then we can provide utility wrappers that will take a 6.0 request running in the new mode and decode any reserved characters to provide the <6.0 behaviour (in fact I was already thinking of providing this mode with a container provided wrapper prior to the first dispatch - hence not making core code horridly modal).

As above, I think making legacy mode support outside the spec container specific simplifies things a lot.

I think the new methods I'm proposing can be justified even without this problem. Too frequently are servletPath and pathInfo recombined to a pathInContext. The resource methods really should be usable with URI and/or Path rather than the antique URL. They a good improvements anyway, and I just really a subject in this discussion because they can be specified with the new better behaviour and never need to be modal.

I'm coming around to the idea of these. I'm not quite convinced yet though as I wonder how much of the requirement for pathInContext is caused by the inconsistencies we are working on solving here.

I think the final part of this puzzle is HttpServletMapping. If we choose option 1 for mapping requests then that would mean the current behaviour of HttpServletMapping is unchanged. Or option 5.

See above.

I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.

How about prior to implementing in Tomcat we create a branch of the api/spec and a draft PR. We can word craft the spec/javadoc/api changes we are describing above in that PR and that will ensure we are really on the same page (I think we are currently in the same chapter of the same book, so that's good). In parallel we can try out that branch in Tomcat and Jetty.

Yes, I think that lines up with what I was thinking. Given the history in this area, making sure we really are in agreement and not interpreting things differently is key.

markt-asf commented 3 years ago

@gregw Does this bring us close enough to start on a PR?

gregw commented 3 years ago

Sure! I'll put something up later today that we can all throw some mud at and see how it comes out! Stand by.....

gregw commented 3 years ago

This comment is an attempt to concisely summarise all the discussions and possible solutions for this issue prior to a meeting to discuss them.

Background The issues discussed here are a result of a combination of factors:

The current URI RFC3986:
- defines dot-segments in 3.3 as only being literally only . or .. and not as the result of any prior decoding or stripping of parameters.
- unlike the obsoleted URI RFC2396 3.3, path parameters are not well defined and are only passingly referenced as possible scheme specific component.
The current servlet specification:
- requires that methods like [getServletPath](https://docs.oracle.com/javaee/6/api/javax/servlet/http/HttpServletRequest.html#getServletPath()) return a full decoded string.
- the spec requires that path parameters are excluded from the path used to match servlets.
- Note however that it does suggest in 3.1 that path parameters should be return from methods like pathInfo, however I believe it is common practise to exclude all parameters from methods like pathInfo.

The Problems The problems that result from the current specification can be roughly grouped into 4 broad categories below :

What URIs are rejected by the container for accessing restricted resources?
The following are examples of URIs that should be rejected by the container, even though a strict reading of RFC3986 might indicate that they are not actually requesting the protected resource:
- /context/foo/%2e%2e/WEB-INF/web.xml
- /context/foo/..;/WEB-INF/web.ml
- /context/WEB-INF%2Fweb.xml
- /context/WEB-INF%00/web.xml
- /context/WEB-INF%5Cweb.xml
- /context/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd
- /context/WEB-IN~1.DIR/web.xml
These URIs can be rejected by the container because the identified resource is known to be protected and/or outside of the docroot. However there is nothing in the specification that clearly states these should be rejected or how they should be (404? 400?). This defence can also have issues with reasonable usage of symlinks.
How URIs are matched to security constraint for resources? Consider the following URIs against a forbidden security constraint of /secret/*:
- /context/foo/%2e%2e/secret/private.xml
- /context/foo/..;/secret/private.xml
- /context/secret%2F/private.xml
- /context/secret%00/private.xml
- /context/secret%5Cprivate.xml
- /context/sEcRet/private.xml (maybe out of scope for this issue?)
- /context/SECRET~1.DIR/private.xml (maybe out of scope for this issue?)
These do not match the /secret/* security constraint, but are typically rejected by the containers default servlet because the identified resource is known to by a different canonical name than the one requested. But if the request is not handled by the containers default servlet, then this protection does not apply.
How might common libraries incorrectly interpret the strings returned from getServletPath() and getPathInfo()? If the examples in 2. are not considered to match the security constraint then the strings returned from the request method may be passed to library methods such as File, Path, or other third party libraries. If the URIs are fully decoded, then all of the example URIs contain characters that are likely to be misinterpreted by the library methods:
- URIs with segments like %2e%2e or ..; will be seen as just .. segments and treated as a relative dot-segment by File and Path
- a decoded %2F will look like / and will be treated as a separated by all uri/path methods
- a decoded %5C will be seen as a \ and may be treated as a separate on some file systems
- a decoded %00 or similar special character may be ignored or treated specially by some file systems or libraries that call into native code.
- Some file systems may also have special interpretations of characters like ; or @ that may have similar problems.
How the container methods like getResource(String) or `getRequestDispatcher(String) might interpret strings obtained either from the container, from the application or a combination of both? URIs that are deemed safe by 1., 2. and 3. may still be decoded to paths that are not safe to pass to some servlet-api methods:
- The problematic characters described in 3. may also be problematic for implementations of ServletContext.getResource(String) as the implementation is likely to do further normalization. Some containers protect from some of these cases by making getResource not return a resource if requested by a non-canonical name.
- methods like ServletContext.getRequestDispatcher(String) may be vulnerable to double decode attacks:
  - /context/foo/%252E%252E/WEB-INF/web.xml which is decoded to /foo/%2e%2e/WEB-INF/web.xml and then used to obtain a request dispatcher from getRequestDispatcher(String), where the string is interpreted as allowing characters to be encoded.
  - /context/secret%3fa=1/private.xml will pass the /secret/* security constraint but be decoded to /secret?a=1. (maybe out of scope for this issue?)
  - /context/SECRET~1.DIR/private.xml will pass the /secret/* security constraint but is 8.3 name for /secret directory. (maybe out of scope for this issue?)

Possible Solutions These are some possible solutions, some of which are already implemented by some containers:

Reject requests with ambiguous/suspicious URIs. E.g. any request has %2F or %2e%2e in its URI is rejected. This will make existing applications safe, but could prohibit some applications from validly accessing these URIs. Some containers already do this.
Restrict the existing path methods to only return safe unambiguous paths and to throw otherwise. Applications wanting to access such URIs will need to use either getRequestURI() or new APIs
Restrict the resource handling performed by the default servlet and getResource to only return resources if requested by canonical path. This can protect applications but can cause frustrations with applications that wish to run on case insensitive file systems or to use synlinks. Some containers already do this.
Perform additional normalization in addition to that required by RFC3986. Specifically resolve dot-segments after decoding and parameter removal, but before the application of security constraints. This will make existing applications safe, but could prohibit some applications from validly accessing some URIs. Some containers already do this.
Avoid decoded string representation of paths by provide a new class that represents an unambiguous interpretation of a path. The new class can precisely give access to each segment without ambiguity caused by encoding, case insensitivity, filesystems etc. new resource/dispatcher methods would be provided that take the new class as well as safe methods to convert to/from Strings, URI, Files, Paths etc.
Require containers to calculate a canonical URI for a resource, using the file systems interpretation of every segment (potentially as part of the new class in the point above). Security constraints would then be applied to the canonical URI rather than the requested URI. This could be very safe, but could also be inefficient for request handling that is not intended to interact with the file system

Edit: added case and 8.3 name alias examples. May be out of scope for this issue, but are kind of related as can have some of the same defences.

gregw commented 3 years ago

@markt-asf @stuartwdouglas I've sent an invite for a call time... if that works for you guys, we can open up to all.

With regards to the aims of the changes, I think lots of far ranging ideas have been discussed, from minimal to significant changes. I think we need to focus the discussion on the following:

what should the existing path methods return (if anything) for all the examples of problematic APIs.
how should methods taking paths react if passed those returned values?
how should methods taking paths react if passed application encoded URIs that are problematic
What are the types of applications we will break by those changes and can we mode the container to make them work, or is new API needed?
What is the best-practise way for applications to safely access complex URIs and to pass resulting values to resource/dispatcher methods?

markt-asf commented 3 years ago

That works for me. Thanks for setting it up. I agree we need answers to those questions. As we have found, everything is inter-related so it is hard to discuss any one of those in isolation. I think there are a couple of relatively simple issues we can discuss first that - assuming we agree an approach - should reduce the complexity of what is left. My current thinking is based around looking at how we get from a received URI to mapping a request to a servlet, filters and security constraints. If you think of this process as a series of stages then answers to the questions 1 to 3 above then take the form of "take the URI as it is at stage X and return the relevant segments" or "the method expects a URI in the form expected at stage Y".

gregw commented 3 years ago

The following table might also focus our discussion.

Consider the following URIs with a forbidden security constraint at /secret/*.
The following files exist in the docroot in a case insensitive file system:

WEB-INF/web.xml
secret/private.xml
public/file.txt
public\file.txt (on unix as a file with \ in the name) OR public/file.txt (on windows as a file with / in the name) There is a servlet registered at /dispatch/* that forwards request to the string return from pathInfo()

URI	Container	get[Servlet]Path[Info]	DefaultServlet
/public/file.txt		/public/file.txt	200
/public//file.txt		/public//file.txt	200 or 404 non canonical ? [1]
/PUBLIC/file.txt	/PUBLIC/file.txt	200 or 404 non canonical ?
/public%2Ffile.txt	400? [3]	/public/file.txt or throw [4]	200 which file ?
/public%5Cfile.txt	400? [3]	/public\file.txt	200 which file?
/public%00/file.txt		/public␀/file.txt	404
/public/./file.txt		/public/file.txt	200
/public/.;/file.txt	400? [3]	/public/file.txt	200?
/public/%2e/file.txt	400? [3]	/public/./file.txt	404 non canonical?
/public/%2e;/file.txt	400? [3]	/public/./file.txt	404 non canonical?
/../docroot/public/file.txt	404
/public/dir/../file.txt		/public/file.txt	200
/public//../file.txt		/public/file.txt	200
/public/dir/..;/file.txt	400? [3]	/public/file.txt	404 non canonical?
/public/dir/%2e%2e/file.txt	400? [3]	/public/../file.txt or throw? [4]	404 non canonical?
/public/dir/%2e%2e;/file.txt	400? [3]	/public/../file.txt or throw? [4]	404 non canonical?
/WEB-INF/web.xml	404
/web-inf/web.xml	404
/WEB-IN~1.DIR/web.xml			404 non canonical ?
/WEB-INF;/web.xml	404
/WEB-INF%2Fweb.xml	404???
/WEB-INF%5Cweb.xml		/WEB-INF\web.xml	404 or 404 non canonical?
/WEB-INF%00/web.xml		/WEB-INF␀/web.xml	404 non canonical ?
/WEB-INF/./web.xml	404
/public/../WEB-INF/web.xml	404
/public/..;/WEB-INF/web.xml	404
/public/%2e%2e/WEB-INF/web.xml		/public/../WEB-INF/web.xml	404 non canonical?
/public/%2e%2e;/WEB-INF/web.xml		/public/../WEB-INF/web.xml	404 non canonical?
/secret/private.xml	403
/SeCreT/private.xml		/SeCreT/private.xml	404 non canonical
/SECRET~1.DIR/private.xml			404 non canonical?
/secret;/private.xml	403
/secret%2Fprivate.xml	403? [2] or 400?	/secret/private.xml	???
/secret%5Cprivate.xml	403? [2]	/secret\private.xml	???
/secret%00/private.xml	400?	/secret␀/private.xml	404 non canonical?
/./secret/private.xml	403
/.;/secret/private.xml	403
/%2e/secret/private.xml	400?	/./secret/private.xml or throw?	404 non canonical?
/%2e;/secret/private.xml	400?	/./secret/private.xml or throw?	404 non canonical?
/public/../secret/private.xml	403
/public/..;/secret/private.xml	403
/public/%2e%2e/secret/private.xml	400?	/public/../secret/private.xml or throw?	404 non canonical?
/public/%2e%2e;/secret/private.xml	400?	/public/../secret/private.xml or throw?	404 non canonical?
/dispatch/public/file.txt		/public/file.txt	200
/dispatch/public%2Ffile.txt	400?	/public/file.txt	?
/dispatch/public%5Cfile.txt	400?	/public\file.txt	?
/dispatch/public%252Ffile.txt	400?	/public%2Ffile.txt	?
/dispatch/WEB-INF/web.xml		/WEB-INF/web.xml	200
/dispatch/secret/private.xml		/secret/private/xml	200
/dispatch/%2E%2E/%2E%2E/etc/password	400?	/../../etc/password	500

[1] Should the 404 non canonicals be redirections to the canonical name or is that leaking info? [2] Can we somehow determine the canonical name prior to applying security constraints? [3] For suspect URIs the container can reject with 400. But if container is configured to allow through, then do we need to define behaviour? [4] If suspect URIs are allowed into the container, should we throw if unsafe methods are used to access them?

joakime commented 3 years ago

Considering how the current HTTP/1.1 spec has no support for the concept of a path parameter, should it's role be HTTP version specific? Such as having HTTP/2 and HTTP/3 should not treat the ';' as a path parameter?

gregw commented 3 years ago

@markt-asf @stuartwdouglas I started editing the wiki page but soon realised I have more questions that need to be discussed before I can write proposed text.

I think we are agreed that we wish to define the default behaviour, plus typical deviations from that which may be allowed by some containers. Given that, I believe those configurations should be a on a per context basis, as different webapps on the same server may have different criteria for URI handling. Given that, we need to define exactly when URI canonicalization takes place with regards to mapping to contexts. Moreover, I don't think we can define rejections within the canonicalization process, because some contexts may not reject... however a context may reject a request due to something found during prior canonicalization (as an optimization rather than re-parsing).

In general, I think the process needs to be something like:

Identify what is the URI - as per step 1. in the wiki
Canonicalize there URI - as per wikie steps 2-6, except that step 6 can be configured with a leave-encoded-set of characters, which is by default empty, but may be configured to include / and % (must always contain % if non empty). See examples below.
Map to context (and filters and servlets)
Context may reject with 400 URIs whose encoded form contained specific characters, sequences or segments defined in the reject-set. By default the reject set will include %2f, \, %5c, /.;, /..;, , /%2e, /%2e%2e, perhaps some more. Containers can configure a context's reject-set.
Contexts may be configured with a throw-set of encoded characters, sequences or segments that will cause the getServletPath and getPathInfo methods to throw rather than return. By default this will be identical to the reject-set and thus by default an application will never throw from these methods. Only if a char/sequence/segment is removed from the reject-set, but not removed from the throw-set may these methods throw.

So some examples:

if an application wanted to have %2F returned from getPathInfo() then the container would be configured to add '/' to the leave-encoded-set and the context configured to remove %2f from the reject-set and throw-set of the context that would handle them. Other contexts would still be safe from %2F as by default it is in their reject set.
if an application wants to handle %2F but only from getRequestUri(), then the container would be configured to add '/' to the leave-encoded-set and the context configured to remove %2f from the reject-set, but left in the throw-set.

I don't like having the leave-encoded-set a container rather than context configuration, but I can't see how else to do it.

Also I have a concern/question about getContextPath() and how that is specified to return the non-decoded context. What does that mean for a URI like /foo/../context/servlet/info? Is the context path for that request /context or /foo/../context ?

I need to think a bit more how to best specify the reject-set and throw-set, as they need to deal with complexities like:

case insensitive encodings: /%2e%2e',/%2e%2E', /%2E%2e',/%2E%2E',
sequences that are only dangerous if they are segments (eg %2e is harmless if not a segment)
sequences that are only dangerous with parameters (eg /.. is only dangerous if follows by a ;) It may be best just to describe the set and leave it to implementations how to implement and configure?

markt-asf commented 3 years ago

I wonder if trying to configure this per context we are making things more complex for the large majority to support something only a very small minority will ever need. Few users will need these features. Even fewer will need different settings per web application. Most deployments are one app per Servlet container. Where there is more than one app deployed it is often because the apps are related (and will likely need the same configuration). Would it be better to say for the small number of users that need different configurations, deploy multiple instances and use a reverse proxy?

gregw commented 3 years ago

@markt-asf if we specify normalization in a way that can be applied per context, then there is nothing stopping implementations that wish to deploy only a single context from implementing it per container. However if we specify normalization in a way that is per container, then it makes it more difficult for implementations to apply per context if they wish.

I don't think the cost is significant. We only need to define normalization well enough until the point that a context can be unambiguously identified before we say any requests are rejected. If that is the default behaviour for all contexts, then there is no difference between a container that actually rejects during normalization vs one that waits until identifying the context before doing so. Since we are not intending to define how a container is configured for non-default behaviour, then impls are free to make that configuration at container level or context level, so long as the spec does not mandate container level rejections.

stuartwdouglas commented 3 years ago

IMHO per context configuration is something that should be a container specific option. I don't think there will be much demand for it in the real world, so I don't think it should be part of the spec.

We actually do support this in Undertow, but we do it by mapping the unencoded URL against context paths, once it is dispatched to the context then that particular context can decode it as required (i.e. you need to specify the context path in terms of the raw encoded bytes). The main use case we had for this was to support different URL encodings per application, although thankfully that seems to be a lot less common these days.

gregw commented 3 years ago

@stuartwdouglas I think you are misunderstanding me. I'm not saying that we should define per context configuration in the spec. I'm just saying that in the steps that we do define, we should leave any rejecting-requests steps until after the mapping-to-context step, so that an implementation has the option of implementing it either as per container or per context.

So in my comment above, I describe reject_set and throw_set, but I don't actually say if they are per container or per context.... but the fact that they are not used until after a context is mapped means that an implementation at least has the option of making them per context. [ edit - oh OK I did actually say "Contexts may ..." in my comment.... but I didn't actually need to ]

markt-asf commented 3 years ago

Ah. I think I had misunderstood your meaning as well. How about the following:

Change stage 7 to "Map URI to Context"
Move the "must reject with 400" part of step 2 to a new stage 8
New stage 9 Map URI to Servlet, Filter, security constraint etc

Given the 2nd paragraph, I think this provides enough flexibility for per container and per context configuration.

gregw commented 3 years ago

@markt-asf that works for me.

But also with the language for the new step 8, rather than say:

URIs containing the following sequences must be rejected...

I'd prefer instead it said something like:

URIs containing sequences contained in the reject-set must be rejected.... By default the reject-set contains the following sequences:.... Containers may add or remove sequences from the reject-set.

I know this is a bit more cumbersome, but it allows simpler and more standard deviations with container specific configurations - for example containers can define how to add/remove sequences from the reject-set without then needing to specify new behaviours. I'm not wedded to the actual name/text. I just think that defining a class of behaviour and then defining the specifics is more flexible than defining sequence by sequence behaviours.

stuartwdouglas commented 3 years ago

So just to clarify, would step 2 still reject by default, unless the relevant item was part of the container level leave-encoded-set?

stuartwdouglas commented 3 years ago

In particular, how would you handle the following scenario with two contexts:

/app/public/ configured to allow %2F
/app/private/ configured to disallow %2F

How do we handle a request to /app/public/..%2Fprivate ?

gregw commented 3 years ago

@stuartwdouglas No, step 2 would only decode non-reserved characters. Rejecting would be done in step 8... oh I think I missed @markt-asf numbering... I think we map first (step 8?) and then reject (step 9?).

Eitherway, in your example the URIs would be:

unchanged by step 2 because they don't have encoded non-reserved characters.
mapped to a specific context (step 8?)
then in step 9, they may or may not be rejected. By default they would both be rejected because the encoded form contained %2F. But because a context has already been identified, a specific container could have container specific per context configuration that permits %2F.

So if /public was configured to allow %2F, then /app/public/..%2Fprivate/file.txt would be mapped to /public before any decoding of reserved characters. The servletPath+pathInfo in that context would be /../private/file.txt (or possible /..%2Fprivate/file.txt if container configuration was also used to leave reserved characters encoded) and it would be up to that context to deal with that URI how it likes. I would expect that if it flowed through to the default servlet that a 404 would result because a) non-canonical b) goes outside/above docroot.

markt-asf commented 3 years ago

You at least need to map to Context before you reject. You can map to Filters and Servlets at the point you map to context or after rejection. The description would be simpler if all the mapping happens as once - although implementations still have the option to do things differently if they wish.

The handling of that examples depends on what container specific options are provided / chosen. Keeping Tomcat close to what it does now, if %2F is allowed Tomcat will leave it in encoded form so /app/public/..%2Fprivate will get mapped to the /app/public context. ServletPath + PathInfo would be /..%2Fprivate. What happens after that will depend on Servlet mappings. If it got as far as the default Servlet I'd expect a 404.

stuartwdouglas commented 3 years ago

Ok yea, I did not really think that example through as once it is mapped to a context it can't escape the context.

One issue is does raise is that you end up with a path of /../private/file.txt, which should not really be possible under our normalization rules. It would be easy to think 'we don't need to care about directory traversal attacks, because the container normalisation process handles this for us', but in this case it would not be true.

TBH I am actually having a bit of a hard time following what the current state of the proposal is, because it is spread out over different comments referencing back to the wiki.

gregw commented 3 years ago

@stuartwdouglas @markt-asf I've had a go at update the wiki. I've adding in that we have to convert octet sequences to characters with UTF-8.

I wish there was a way to comment & suggest on the wiki like there is with PRs. Maybe we should move this to another PR so we can do so?

stuartwdouglas commented 3 years ago

I think a PR would be easier. I mostly agree with everything there (including the TODO's that need more work), and I have added an additional TODO: if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed?

gregw commented 3 years ago

I'll move to a PR.... stand by.....

gregw commented 3 years ago

I have created #424.
Sorry @markt-asf, I've kind of hacked the wiki contents around a fair bit... and not yet necessarily for the better. Hopefully the PR tools will allow us to iterate and refine.

markt-asf commented 5 months ago

The final PR included an update to the Javadoc for HttpSservletRequest.getContextPath() that stated the return value would be canonicalized. However, this method has always returned exactly what the client sent. No canonicalization. No decoding. This looks like an error on my part when I added the final updates to the Javadoc. Do we want to remove the canonicalization text?

gregw commented 5 months ago

We've always interpreted this as returning the encoded context path, but what is configured on the server rather than what the client sent. I think returning whatever the client sent is a bit non deterministic.

markt-asf commented 5 months ago

We have a narrow window of opportunity to fix this for 6.1 since I need to re-roll the release to fix the Javadoc copyright dates. I intend to simply remove the "The path will be canonicalized as per..." sentence that was added unless we managed to agree an alternative text very quickly (today).

My understanding of the expected behaviour is that for a context path of `/aaa` this method behaves as follows:	Request URI	Context Path
`/aaa/index.html`	`/aaa`
`/a%61a/index.html`	`/a%61a`
`/bbb/../aaa/index.html`	`/bbb/../aaa`

The no decoding (2nd line) is explicit in the original Javadoc.

No canonicalization (3rd line) was implied. From memory that requirement originated from a need to be able to strip the context path from the return value of getRequestURI().

My proposal for a change is to align it with getRequestURI(). ie:

remove the "The path will be canonicalized as per..." sentence
change the first sentence to start "Returns the portion of the request URI from the first line of the HTTP request that ..."

markt-asf commented 5 months ago

I applied the simple fix that removed the "The path will be canonicalized as per..." sentence and re-tagged. We can agree the new wording for 6.2 in slightly slower time.

gregw commented 5 months ago

I'm good with the simple fix for 6.1. I'm still cautious about the idea of returning what the client sent for the context path, least not that it is actually difficult to work out. But that's an ambiguity that had existed before asked nobody had complained that jetty returns /aaa for all your examples. We only return an encoded character if it needs to be encoded in they context path. But agree we can sort that out in 6.2