Open glassfishrobot opened 13 years ago
@glassfishrobot Commented Reported by markt_asf
@glassfishrobot Commented markt_asf said: Add getRequestURL() to the methods to be reviewed
@glassfishrobot Commented This issue was imported from java.net JIRA SERVLET_SPEC-18
An additional point requiring clarification is that RFC 3986 (section 2.2) states that %nn encoded reserved characters are not equivalent to their decoded forms. i.e. "/foo/bar/" and "/foo%2Fbar" are not equivalent. This raises an additional question. When a %nn reserved character appears in any part of the URI should it be decoded or not in the return value of each of the above methods? I'm leaning towards a solution that leaves them in their %nn form and adds a utility method to the API that performs %nn decoding.
the servlet spec in section 3.1 says:
Path parameters that are part of a GET request (as defined by HTTP/1.1) are not exposed by these APIs. They must be parsed from the String values returned by the getRequestURI method or the getPathInfo method.
I think when those words were written URIs were defined by RFC2396 which well defined path parameters in section 3.3:
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
However, that RFC has since been obsoleted by RFC3986 which now no longer has a normative definition for path parameters and instead just says:
Aside from dot-segments in hierarchical paths, a path segment is considered opaque by the generic syntax. URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm.
So this gives us two problems. Firstly I now cannot find any current specification of HTTP URL path parameters and how they should be parsed. So I think the servlet spec could be relying on the grandfathered definition from RFC2396. IS this true? Does anybody know of a current specification for them?
Secondly, there are URIs that if RFC3986 is applied to (ie normalisation, then decoding and param handling) then the result is very ambiguous when received as a string. Examples include:
URI | RFC3986 normalised | common error |
---|---|---|
/foo/%2e%2e/bar | /foo/../bar | /bar |
/foo/aaa%2fbbb/../bar | /foo/bar | /foo/aaa/bar |
/foo/..;/bar | /foo/../bar | /bar |
I'd love to see us make some progress on this in the next iteration of the Servlet spec.
There are a lot of edge cases and interdependencies between edge cases as well as places where the meaning RFCs may not be immediately clear and/or has been misinterpreted in the past. We have a wiki so my suggestion is that we draft some text so we have a common, agreed understanding in the wiki, with examples, and then discuss the changes we wish to make to the specification document and/or the Javadoc so that the specification is aligned with that common understanding. I'll get started on that Wiki page and I'll post again here when I have something ready to start discussing.
Wiki page has been created. I think the first thing to tackle is path parameters.
https://github.com/eclipse-ee4j/servlet-api/wiki/HTTP-URIs-and-Servlet-API-methods
I think we also need to consider security issues that are introduced by ill defined path parameters. Jetty recently "improved" our URI handling to be more compliant with RFC3986 and that just resulted in a bunch of CVEs as it created many issues.
For example in the URI /foo/..;/bar
the middle segment is not ..
so by strict RFC3986 it should not be normalized. But once we strip out the path parameter the resulting path has a pure ..
segment, so it is normalized by the file system. So for example a request to /foo/..;/WEB-INF/web.xml
was not blocked as it's path started with /foo
not /WEB-INF
, but when the path was given to the file system, the web.xml was served. There are similar problems with encoded dots and / if you strictly respect RFC3986 (eg /foo/%2e%2e/WEB-INF/web.xml
& /foo%2f../WEB-INF/web.xml
) so we have reverted to not following RFC3986, specifically in that we decode and then normalize. This is better for security, but it is not good that we do not follow the RFC for URI handling.
Fundamentally getServletPath()
and getPathInfo()
are ambiguous when they return strings that may include decoded reserved characters or removed parameters.
Agreed wholeheartedly re security issues. It is also important we clarify these ambiguities and get consistent behaviour between containers because inconsistent behaviour between containers can also be a source of vulnerabilities (if an app coded for one container works securely on that container it may not be secure on a different container). With respecting to %nn encoding, I was planning on looking at that after path parameters. While the two aren't independent, I think we can look at path parameters in isolation first and then, once we have a plan for path parameters, see what impact %nn encoding has on that plan. For path parameters, I've done some testing along these lines:
URI uriC = new URI("file:///home/aaa/..;/mark");
uriB = uriC.normalize();
File fileC = new File(uriC);
System.out.println(uriC.toString());
System.out.println(fileC.isDirectory());
The result is:
file:///home/aaa/..;/mark
false
so my conclusion is that if we retain path parameters, we have to retain them all the way through the processing chain and if we do that the request will fail because the context path doesn't match, the servlet path doesn't match or the file doesn't exist. That is the second option in the wiki page option. The other option is to remove them early on in the processing. I've added a comparison of the options to the wiki.
Despite arguing consistently for option 1 on the Tomcat lists for a number of years (and implementing it) I'm beginning to think that option 2 is the better option.
@markt-asf I'm not entirely sure that we can separate discussion of ;
from issues associated with other difficult characters like %
, #
, ?
, .
, /
and \
. But I'm happy to try primarily focusing on the issues of ;
and using other characters as sanity checks.
I think an exercise that would be good to help clarify how these characters should be handled in received URIs is to better define what path
is in the ServletContext.getResource(String path)
API. It is entirely unclear to me if it is an encoded URI path, a decoded URI path or in the file system space. Defining exactly what space this string is in will help us define getServletPath()
and getPathInfo()
because I think they need to return a string in the same space.
If the path
parameter is a decoded URI path then:
getResource("/foo;oo/bar.txt")
will return a URL of something like file://path/to/webroot/foo%3boo/bar.txt
, ie a file called "bar.txt" in a directory called "foo;oo"/
is a character in the name and not a separator (possible on some, but not all file systems) then that resource cannot be accessed by getResource
as getResource("/foo/bar.txt")
will look for "bar.txt" in "/foo" and getResource("/foo%2fbar.txt")
will look for a file with a %
in its name (equiv to file://path/to/webroot/foo%252fbar.txt
with the % itself % encoded).getResource("foo\bar.txt")
does not find "bar.txt" in directory "foo", but instead looks for a file with the \
in it's name.Fundamentally, the problem with the decoded URI path string space is that once decoded, you can't tell the difference between a URIs /foo/bar
and /foo%2fbar
. Thus if we leave in parameters, we have the same problem that it is impossible to distinguish between URIs /foo;oo/bar
and /foo%3boo/bar
. Thus the strings returned from the path methods and thosed passed to the getResource method are not sufficient to identify all possible resources as they have discarded some information.
If we want to be able to access all resources, the getResource(String)
needs to take an encoded URI path. But then it is different to the path methods, which are documented as returning decoded strings, so we would make applications vulnerable to double decode attacks like /foo/%25./WEB-INF/secret.jsp
. It only makes sense to do this if the path methods also return strings in the same space so that ServletContext.getResource(request.getPathInfo())
is a safe operation. But that is a big change that would break many/most applications that do not expect they have to do their own % encoding.
One way forward I see is:
;
, %
, /
and \
in their names cannot be accessed by normal requests);
/
or \
) is by default rejected with a 400.I agree that path
in ServletContext.getResource(String path)
should be consistent with the values returned for getServletPath()
and getPathInfo()
.
As I have got my head around the evolution of path parameters in the URI RFCs from the very specific in RFC 1808 to the very generic in RFC 3986, I am now in agreement with you that we need to approach path parameters and %nn decoding in combination.
I have a way forward in mind that I think is very similar to the way forward you expressed.
I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:
The original path component is stored for use as the return value for getRequestURI()
. (This value is neither normalized nor decoded.)
Any jsessionid path paraemeter present is parsed, stored and removed from the path. This makes the use of URL re-writing for session tracking mostly transparent to the application.
Unreserved characters are %nn decoded.
Exact "/../"
and "/./"
sequences are normalized. As per the RFC 3986 examples, "/aaa/bbb;c=d/../eee"
will be normalized to "/aaa/eee"
. "/..;/"
would be left as-is.
The normalized path (still containing path parameters and %nn encoded reserved characters) is mapped to provide contextPath
, servletPath
and pathInfo
. These are normalized but encoded.
Three new methods getContextPathEncoded()
, getServletPathEncoded()
and getPathInfoEncoded()
provide access to these normalized, encoded values.
Another three new methods getContextPathDecoded()
, getServletPathDecoded()
and getPathInfoDecoded()
take the respective normalized and encoded values. If an unencoded reserved character is present (i.e. path parameters are present) or if an encoded '/'
is present, null
is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.
The default servlet is expected to use the getXXXDecoded()
methods to map the URI to the file system
path
in ServletContext.getResource(String path)
is equivalent to getXXXDecoded()
The current getContextPath()
, getServletPath()
and getPathInfo()
methods are deprecated with a note that their behaviour was ambiguous regarding normalization and decoding and that users should refer to the documentation for their container to determine the actual behaviour of these methods for that container.
This means files that use any reserved character other than '/'
(which isn't a valid character for a file name on either Linux or Windows) in their names will be accessible via a normal request.
Applications and frameworks that want to use path parameters can do so via the getXXXEncoded()
methods without risk of double decoding (because when we %nn decode we only decode unreserved characters).
What I'd really like to do is to avoid deprecating getContextPath()
, getServletPath()
and getPathInfo()
as they are so fundamental to the API. However, if we use them instead of getXXXDecoded()
, the null
s will likely break stuff. It we use them instead of getXXXEncoded()
the unexpected encoding will likely break stuff.
One possibility would be for the application to declare which reserved characters it wished to use as delimiters. '/'
would always be a delimiter. The steps above then change to:
...
Unreserved characters and reserved characters not declared to be used by the application characters are %nn decoded.
Exact "/../"
and "/./"
sequences are normalized. As per the RFC 2986 examples, "/aaa/bbb;c=d/../eee"
will be normalized to "/aaa/eee"
.
The normalized path (still containing path parameters and %nn encoded reserved characters declared to be used by the application) is mapped to provide contextPath
, servletPath
and pathInfo
. These are normalized but encoded.
Three new methods getContextPathEncoded()
, getServletPathEncoded()
and getPathInfoEncoded()
provide access to these normalized, encoded values.
getContextPath()
, getServletPath()
and getPathInfo()
take the respective normalized and encoded values. If an unencoded reserved character declared to be used by the application is present (i.e. path parameters are present) or if an encoded '/'
is present, null
is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.
The default servlet is expected to use getServletPath()
and getPathInfo()
to map the URI to the file system
path
in ServletContext.getResource(String path)
is equivalent to getPathInfo()
%25
is an edge case in the above that would need some more thinking about in this appraoch to avoid double decoding issues but I think it is solveable.
I agree with a lot you say... but a few points inline...
I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:
- The original path component is stored for use as the return value for
getRequestURI()
. (This value is neither normalized nor decoded.)- Any jsessionid path paraemeter present is parsed, stored and removed from the path. This makes the use of URL re-writing for session tracking mostly transparent to the application.
- Unreserved characters are %nn decoded.
We should also specifically prohibit %uXXXX decoding. This was a rejected proposal, but there are still some usages of it, so we still offer it as a legacy mode. Would be good to kill that completely.
- Exact
"/../"
and"/./"
sequences are normalized. As per the RFC 3986 examples,"/aaa/bbb;c=d/../eee"
will be normalized to"/aaa/eee"
."/..;/"
would be left as-is.- The normalized path (still containing path parameters and %nn encoded reserved characters) is mapped to provide
contextPath
,servletPath
andpathInfo
. These are normalized but encoded.- Three new methods
getContextPathEncoded()
,getServletPathEncoded()
andgetPathInfoEncoded()
provide access to these normalized, encoded values.
The existing getContextPath()
actually returns the encoded context path! (see javadoc that says: "The container does not decode this string"). Which is good for the number of new methods needed, but is going to make naming of new methods a nightmare!
The other thing to consider is do we really need the decoded path broken down into servletPath and pathInfo? I see so much code that is adding those back together. How about we just provide a new getEncodedPathInContext()
method that returns servletPath+pathInfo ? Actually since this is a new method and we already have a mix of encoded and decoded return types, we could just call it getPathInContext()
for simplicity and document that it returns the path with parameters any reserved characters encoded.
- Another three new methods
getContextPathDecoded()
,getServletPathDecoded()
andgetPathInfoDecoded()
take the respective normalized and encoded values. If an unencoded reserved character is present (i.e. path parameters are present) or if an encoded'/'
is present,null
is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.
I don't like even more new methods, as the API is already very fat and many of these methods might need to be replicated in request attributes, which are already too many! So instead I think we just accept that the existing getServletPath()
and getPathInfo()
are decoded. Instead I think we should offer some modal behaviour, where the application can declare how it wants them to work. I see the following modes (names are WIP):
LAX
they just return the decoded path as they do today with params stripped - thus this mode doesn't break anybody (unless they are already broken when hit with ambiguous URIs)STRICT
they return a decoded path IFF there are no encoded reserved characters nor parameters (other than JSESSIONID which we already remove), otherwise null is returned (or perhaps an exception thrown?)SAFE
the container rejects with a 400 any request with a URI that encodes reserved characters or parametersThe default mode could be LAX
if we don't want to break anybody, but I'm happy to allow the container to pick the default mode if a particular webapp does not indicate a preference.
- The default servlet is expected to use the
getXXXDecoded()
methods to map the URI to the file system
Firstly, the default servlet cannot map URIs directly to the file system and must go via a getResource
like API so it can access META-INF/resources
in fragment jars etc. This also gives it the same behaviour as JspServlets etc.
Secondly using the decoded path is not sufficient to resolve the ambiguous URIs like /foo%2fbar
. Instead I think we need to provide a new method URI ServletContext.getUri(String encodedPath)
and the default servlet needs to serve resources equivalent to URI ServletContext.getUri(request.getPathInContext())
. Note this switches to URI
rather than the horrid URL
class, but I guess it could also be done with Path
, or maybe we provide both getPath
and getUri
methods?
path
inServletContext.getResource(String path)
is equivalent togetXXXDecoded()
See above... is equivalent to the string returned byget[Servlet]Path[Info]()
The current
getContextPath()
,getServletPath()
andgetPathInfo()
methods are deprecated with a note that their behaviour was ambiguous regarding normalization and decoding and that users should refer to the documentation for their container to determine the actual behaviour of these methods for that container.
See above. I'd prefer not to deprecate these methods as the vast majority of applications use them and have no need to ever see an path that contains a decoded reserved character. Thus the vast majority of applications will continue to work perfectly well with no code changes even if their container switches to STRICT
or SAFE
mode. The few applications that do need handle encoded reserved characters are probably working with getRequestURI
anyway, but they can be modified to use the new methods if they like.
This means files that use any reserved character other than
'/'
(which isn't a valid character for a file name on either Linux or Windows) in their names will be accessible via a normal request.
Remember that resources can be served from META-INF/resources
in a fragment jar from WEB-INF/lib
. These jars can be produced on one OS and run on another and thus could contain all sorts of nasty characters that are not legal in the local filesystem. Also OS's can mount foreign filesystems, so I just don't think we can assume that a resource name will never contain /
. I admit that such resources should be vanishingly rare and I can't really see a use for them (other than as some kind of exploit bomb droped into somebody elses WEB-INF/lib
. Thus I'm good with an approach that basically just rejects such URIs unless the application declares that it really wants to use them.
Applications and frameworks that want to use path parameters can do so via the
getXXXEncoded()
methods without risk of double decoding (because when we %nn decode we only decode unreserved characters).
But we do need to provide something like my proposed URI ServletContext.getUri(String encodedPath)
and/or a ServletContext.getResourceEncoded(String)
method, that handle exactly only those encoded reserved characters and throws IAE if it starts seeing other % encoded characters. i.e. it accepts exactly the strings returned from the new getXxxEncoded()
method(s).
What I'd really like to do is to avoid deprecating
getContextPath()
,getServletPath()
andgetPathInfo()
as they are so fundamental to the API. However, if we use them instead ofgetXXXDecoded()
, thenull
s will likely break stuff. It we use them instead ofgetXXXEncoded()
the unexpected encoding will likely break stuff.One possibility would be for the application to declare which reserved characters it wished to use as delimiters.
'/'
would always be a delimiter. The steps above then change to:
URI handling should be portable and we really don't need any doubt about what are reserved characters or not, specially as the getResource[En|De]coded
style APIs are expected to be fed from the results of the getPath[En|De]coded
style APIs.
So in summary, I like the decoding approach you've outlined, but think we can achieve the outcome we want with:
String HttpServletRequest.getPathInContext()
and URI ServletContext.getUri(String encodedPath)
(and/or Path ServletContext.getPath(String encodedPath)
). get[Servlet]Path[Info]()
methods so that they either: act as they do today; return null for ambiguous URIs; are not even possible to call for ambiguous URIs. The other good thing about this approach is that the modes, once defined can be applied by a container to existing applications written against the current servlet-api versions and 99.9% of webapps will work perfectly fine in STRICT
or SAFE
mode with no code changes. Yes, I agree with pretty much all of that. It is great to feel that some progress is being made in this area.
I wasn't aware %uXXXX
was even a possibility. I'd be happy to exclude that and any other weirdness. Something like:
The only permitted encoding scheme for URIs is the
pct-encoded
scheme defined in RFC 3986.
I'd forgotten that HttpServletRequest.getContextPath()
returned the encoded path. I'm surprised given the complete pain that was to implement. Maybe I blocked it out ;). I agree that complicates naming.
Regarding splitting servletPath
and pathInfo
, the feedback I have from the Spring framework is that they do need them to be separate. I'm not against new methods like getPathInContext()
but neither am I convinced it is necessary at this point.
I hear your objection regarding new methods and implications for request attributes. That is a good reason to try and minimise the number of new methods.
I agree the default servlet needs to use something like getResource()
. My point was more about what gets passed to that method. I'm aiming for a solution that supports the widest possible set of filenames with the minimum number of special cases that require additional handling. My hope is that we can define behaviour in such a way that as many of the ambiguities as possible are designed out. For example so that using a path parameter (other than jsessionid) with a static resource would result in a 404.
My thinking around allowing applications to effectively remove some characters from the reserved set was to assist with migration with what was going to be a stricter regime. If the consensus is that that would create more ambiguity/problems than it solves, I'm happy to leave that as a potential container specific feature.
I like the idea of allowing the application to operate in different modes. That provides a lot more flexibility without having to add a new set of methods for each mode. Regarding each mode:
LAX
: Maybe call this LEGACY
? I agree this should be the "same as Servlet 5" mode. My only comment is that getContextPath()
won't be decoded in this mode. The behaviour of this mode may be slightly different for different containers depending on how they implemented the methods in prior versions.
DECODED
: new mode. Should be very similar to LAX
(depending on container behaviour in prior versions) but like getServletPath()
and getPathInfo()
, getContextPath()
also has parameters stripped, is decoded and normalized.
STRICT
: I'd prefer returning null
rather than throwing an exception if %nn
and/or parameters are present as I think this comes under "Use exceptions only for exceptional conditions"
SAFE
: I think this is very similar to STRICT
(since I'm assuming the application will treat a 'null' being returned in STRICT
as an error condition) but moves the rejection of the request earlier in the processing. I'd suggest a 404 rather than a 400 since that suggests the request is not a valid request rather than something the server is chossing not to respond to. I've been trying to think of a use case where STRICT
would be chosen over SAFE
and I can't. Did you have a use case in mind?
ENCODED
: new mode. Non-reserved characters %nn
decoded. Exact "/../" and "/./" sequences are normalized. Encoded reserved characters and path parameters remain.
After some further experiments with new File(URI)
, I agree there is a need for something like URI ServletContext.getUri(String encodedPath)
to bridge the gap between the possibly encoded and/or containing parameters path returned by getServletPath()
and getPathInfo()
and methods like ServletContext.getResource(String)
and friends. However, I think it needs to return String
so that can then be passed into the getResource
methods. I'm not sure about the name but something like String ServletContext.getResourcePath(path)
where:
path
contains any reserved characters apart from /
the method returns null
(this effectivly blocks parameters)path
contains %2f
the method returns null
. We could add a boolean parameter to allow this but I can't think of a use case for this but I can think of plenty of security exploits if it were allowed.If the returned string was then used in ServletContext.getResource(String)
and friends, it should mean that files that happen to use reserved characters in their paths would be accessible.
If we included the behaviour of ServletContext.getResource(String)
and friends in the definition of the operating modes discussed above, we could potentially remove the need for the new conversion method by having the container do this conversion inside ServletContext.getResource(String)
and friends. It would only be needed for the new ENCODED
mode.
I think the use-case for a mode like STRICT
vs SAFE
is a webapp that has multiple servlets, some of which that are prepared to deal with encoded paths, but others that have not been updated to do so. In such a mixed mode deployment, the updated servlets could use new methods to access the encoded path, but if any requests with encoded paths arrived at non-updated servlets, then they would either throw or get a null if they used the existing path methods.
Ultimately, this scenario shows the limits of the modal approach, in that the mode really doesn't apply to the whole webapp, but should apply servlet by servlet and filter by filter. However I don't think that fine grained approach would be workable, as what do you do if a request is mapped to a servlet that is ENCODED
mode, but also matches a filter that declares it is in DECODED
mode. Thus I think the modes need to be container wide. But I also think that having a new API to get the resesrved-only-encoded form would allow filters/servlets to be written that will work correctly regardless of mode.
So I like your modes (and their names), but they only apply to the existing path methods and to the interpretation of the existing getResource(String) methods.
We then need a new methods like String HttpServletRequest.getUriPathInContext()
and URI ServletContext.getURI(String uriPathInContext)
that work on unambiguous reserved-only-encoded forms that will work in exactly the same way regardless of the mode. We then can have:
SAFE
is used for 99.99% of applications that don't care about strangely encoded pathsSTRICT
can be used by applications that have some filters/servlets updated to use new APIs but some filters/servlets that are not updated but need to be protected from strange encodings.ENCODED
can be used by applications that have updated all filters/servlets to be able to handle reserved-only-encoded forms.LEGACY
is for the 0.001% of applications that really need the current bad behaviour DECODED
is the one I can't really see a use-case for, but it's existance is kind of symmetric with ENCODED
So we could get away with just SAFE
, STRICT
and LEGACY
if we wanted to minimize the number of modes.
I've been thinking about the modes and I'd really like to keep them to a minimum.
I think we can get by with just a boolean legacy mode in which the getServletPath
and getPathInfo
methods return fully decoded paths and the getResource
and getRealPath
methods expect fully decoded paths as well.
When not in legacy mode then getContextPath
, getServletPath
and getPathInfo
return strings in which only URI reserved characters and ;
are % encoded. The getResource
and getRealPath
methods expect similarly reserved-only encoded strings (any encoded non-reserved character is an IAE).
The behaviours of other modes described above can be obtained by filters or perhaps new security constraint types. To assist with that, it would be good to have a new method boolean HttpServletRequest.isFullyDecoded()
that would return true if there are no encoded reserved characters in the URI (name is WIP).
The I still think we need the new String HttpServletRequest.getPathInContext()
and URI ServletContext.getURI(String pathInContext)
that always work in encoded-non-reserved mode regardless of the legacy mode.
We could even avoid having an explicit legacy mode switch and instead just use the version of the web.xml file (with no web.xml defaulting to 6.0 semantics).
In summary, in a 6.0 webapp, all the following methods would work in only-reserved-characters-encoded mode: getContextPath()
, getServletPath()
, getPathInfo()
, getPathInContext()
, getResource(String)
, getRealPath(String)
, getURI(String)
. In legacy mode, the pre-existing methods work as they always did, but the new methods work in only-reserved-characters-encoded mode.
I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points.
I agree reducing modes is good. I've no objection to dropping STRICT
and SAFE
. The more I thought about it, the less comfortable I was with the idea of rejecting valid URIs. LEGACY
and DECODED
should be very similar. My thinking in adding DECODED
was to cover the case where a container's current behaviour was closer to ENCODED
. However, I haven't seen any evidence that any container behaves like that so DECODED
could be dropped on the basis the use case for it is purely theoretical. That leaves LEGACY
and ENCODED
as you suggested. I'm not sure about using a boolean to control the mode. I'd like to leave open the possibility of containers adding their own custom mode(s) and/or us adding an additional mode as this discussion evolves. An Enum was my first thought but that isn't extensible. Maybe a string where values that start X-
are reserved for container specific extensions and all other names are reserved for the specification.
For LEGACY
mode, I think we are in agreement if by "fully decoded paths" you mean:
jsessionid
if any, removing it from the URI%2f
For ENCODED
mode, the above becomes:
jsessionid
if any, removing it from the URIfor
%`Which brings us nicely to the topic of mapping. In LEGACY
mode mapping is easy. In ENCODED
mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:
<url-pattern>
elementsSpring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like /..;/
could be used for a path traversal attack.
We also need to decide if the vales in <url-pattern>
elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here.
I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.
An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages.
I think the final part of this puzzle is HttpServletMapping
. If we choose option 1 for mapping requests then that would mean the current behaviour of HttpServletMapping
is unchanged.
What was your thinking in requring an IAE if getResource()
and friends were called with a path that included a %nn encoded non-reserved character?
I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.
I'll digest your main responses tomorrow.... but I'll answer the easy one below.
What was your thinking in requring an IAE if
getResource()
and friends were called with a path that included a %nn encoded non-reserved character?
The idea is that those methods should take precisely the strings returned from the path methods. Currently they accept decoded paths, so if there is a % character in the arg, it is considered part of the file name. If we start allowing arbitrary characters to be encoded, then we are more likely to break the few apps that do actually pass % characters. If we only accept % encoding for precisely the reserved characters that the path methods leave encoded then that a) forces us to very well define those characters; b) makes minimal changes to how those methods are used today (for the 99.999% case); c) any modes introduced are exactly the sane for what the path methods return and what the resource methods take.
I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points.
That leaves
LEGACY
andENCODED
as you suggested. I'm not sure about using a boolean to control the mode. I'd like to leave open the possibility of containers adding their own custom mode(s) and/or us adding an additional mode as this discussion evolves. An Enum was my first thought but that isn't extensible. Maybe a string where values that startX-
are reserved for container specific extensions and all other names are reserved for the specification.
How this is handled all depends on where we see the future. Do we want an arbitrary Servlet-9.0 application to be able to continue to choose between the modes for the path methods - ie LEGACY is an entirely valid mode into the future; or is that the future of the path methods is to return the reserve-encoded form that we define in 6.0 and we only offer a legacy mode to ease the transition. I favour the later as carrying legacy modes over multiple major releases will result in the need to version the legacy modes.
So if we say the path methods from 6.0 onwards are intended to return the reserve-encoded form, then I think it is entirely valid to use the existence of a < 6.0 web.xml to trigger the legacy mode. We can then add some verbage to say that a container is free to all the mode to be specifically configured and to override the web.xml. If we go this way, then we don't even need to define named modes a LEGACY
and ENCODED
. Instead we just define the bahaviour of these methods in 6.0 and note that they behaive differently if there is a <6.0 web.xml present.
For
LEGACY
mode, I think we are in agreement if by "fully decoded paths" you mean:
- extract
jsessionid
if any, removing it from the URI- remove all remaining path parameters
- decode everything apart from
%2f
Nope - I think everything is decoded, including %2f
. Do you currently leave %2f
encoded?
- normalize
- map to Servlet
- determine paths based on the Servlet the request was mapped to
For
ENCODED
mode, the above becomes:
- extract
jsessionid
if any, removing it from the URI- decode all %nn sequences excluding RFC 3986 reserved characters and '%25
for
%`- normalize
- map to Servlet
- determine paths based on the Servlet the request was mapped to
Yep, modulo treatment of ;
below.
Which brings us nicely to the topic of mapping. In
LEGACY
mode mapping is easy. InENCODED
mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:
- ignore path parameters when mapping
- introduce some form of URI pattern matching that accounts for path parameters in
<url-pattern>
elements- only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere
- soemthing else
Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like
/..;/
could be used for a path traversal attack.
What about option 5. Treat ; characters other than as part of a JSESSIONID as just a normal character in a segment?
According to the RFCs, path parameters are no longer a thing generally. So we should take out the one for session ID and leave all others in the path. For mapping, we would then allow a filter/servlet to be mapped to something like /foo;bar
where ;
is treated no differently to any other character. If we really REALLY wanted to do more, then perhaps we could introduce a new mapping like /foo;*
which would match any URI like /foo
, /foo/
, /foo;anything
, but not /foo/bar
or /foo;anything/bar
. We don't support any mid patterns like /foo/*/bar
or /foo*/bar
already, so I don't see why we should start for parameters.
We also need to decide if the vales in
<url-pattern>
elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here. +1 And I'd prefer it to be an error if any other characters are encoded.I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.
Hmmmmm maybe. But I think this just complicates things like asynchronous threads that call the path methods. Already it is a problem that an async thread looking at a request that is passed to a request dispatcher will see different values in a race with the handling of the request in and out of the dispatcher. If we make this per servlet, then an async thread will see different encoding values in a race. The real fix for this is to get rid of object-identity-wrapping so that async threads can be passed a container-wrapped request and the values they see from the path methods would never change.... but that is another discussion.
Ultimately, it comes down to if this mode is something valid going forward or just some legacy help to get applications ported to the new 6.0 behaviour. I think the later, so I think having a context wide mode is fine. If there are really some filters/servlets that can run in that mode then we can provide utility wrappers that will take a 6.0 request running in the new mode and decode any reserved characters to provide the <6.0 behaviour (in fact I was already thinking of providing this mode with a container provided wrapper prior to the first dispatch - hence not making core code horridly modal).
An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages.
I think the new methods I'm proposing can be justified even without this problem. Too frequently are servletPath
and pathInfo
recombined to a pathInContext
. The resource methods really should be usable with URI
and/or Path
rather than the antique URL
. They a good improvements anyway, and I just really a subject in this discussion because they can be specified with the new better behaviour and never need to be modal.
I think the final part of this puzzle is
HttpServletMapping
. If we choose option 1 for mapping requests then that would mean the current behaviour ofHttpServletMapping
is unchanged. Or option 5.I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.
How about prior to implementing in Tomcat we create a branch of the api/spec and a draft PR. We can word craft the spec/javadoc/api changes we are describing above in that PR and that will ensure we are really on the same page (I think we are currently in the same chapter of the same book, so that's good). In parallel we can try out that branch in Tomcat and Jetty.
What was your thinking in requiring an IAE if
getResource()
and friends were called with a path that included a %nn encoded non-reserved character?The idea is that those methods should take precisely the strings returned from the path methods. Currently they accept decoded paths, so if there is a % character in the arg, it is considered part of the file name. If we start allowing arbitrary characters to be encoded, then we are more likely to break the few apps that do actually pass % characters. If we only accept % encoding for precisely the reserved characters that the path methods leave encoded then that a) forces us to very well define those characters; b) makes minimal changes to how those methods are used today (for the 99.999% case); c) any modes introduced are exactly the sane for what the path methods return and what the resource methods take.
For getResource()
and friends, I think the way we map the supplied path to a resource should mirror the way we map a request path to a Servlet. Same normalization, decoding etc. I don't see how an IAE would work with such an approach.
How this is handled all depends on where we see the future. Do we want an arbitrary Servlet-9.0 application to be able to continue to choose between the modes for the path methods - ie LEGACY is an entirely valid mode into the future; or is that the future of the path methods is to return the reserve-encoded form that we define in 6.0 and we only offer a legacy mode to ease the transition. I favour the later as carrying legacy modes over multiple major releases will result in the need to version the legacy modes.
So if we say the path methods from 6.0 onwards are intended to return the reserve-encoded form, then I think it is entirely valid to use the existence of a < 6.0 web.xml to trigger the legacy mode. We can then add some verbage to say that a container is free to all the mode to be specifically configured and to override the web.xml. If we go this way, then we don't even need to define named modes a
LEGACY
andENCODED
. Instead we just define the bahaviour of these methods in 6.0 and note that they behave differently if there is a <6.0 web.xml present.
I'm happy to treat LEGACY
mode as a transitional mode. Given that each container probably does something slightly different at the moment, I think there are multiple benefits to treating legacy compatibility - including how it is enabled - as a container specific extension.
For
LEGACY
mode, I think we are in agreement if by "fully decoded paths" you mean:
- extract
jsessionid
if any, removing it from the URI- remove all remaining path parameters
- decode everything apart from
%2f
Nope - I think everything is decoded, including
%2f
. Do you currently leave%2f
encoded?
Yes. We have had various requests for this. Most recently: https://bz.apache.org/bugzilla/show_bug.cgi?id=62459
The short version is that there needs to be a way to pass /
where it isn't treated as the segment delimiter. %2f
is the only way to do that.
But since this is LEGACY
mode then assuming that we agree to leave legacy support as container specific then different containers continuing to take different approaches is OK.
Which brings us nicely to the topic of mapping. In
LEGACY
mode mapping is easy. InENCODED
mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:
- ignore path parameters when mapping
- introduce some form of URI pattern matching that accounts for path parameters in
<url-pattern>
elements- only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere
- something else
Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like
/..;/
could be used for a path traversal attack.What about option 5. Treat ; characters other than as part of a JSESSIONID as just a normal character in a segment? According to the RFCs, path parameters are no longer a thing generally.
I have a very different reading of the RFCs. My reading is that they RFCs have been getting increasingly generic in what is allowed as a path parameter. See the final paragraph of RFC 3986, section 3.3
So we should take out the one for session ID and leave all others in the path. For mapping, we would then allow a filter/servlet to be mapped to something like
/foo;bar
where;
is treated no differently to any other character. If we really REALLY wanted to do more, then perhaps we could introduce a new mapping like/foo;*
which would match any URI like/foo
,/foo/
,/foo;anything
, but not/foo/bar
or/foo;anything/bar
. We don't support any mid patterns like/foo/*/bar
or/foo*/bar
already, so I don't see why we should start for parameters.
Path parameters aren't common, but where they are used, typical use cases are more complex than this. For example; /foo;a=1;b=2/bar
. The Spring docs have more examples. Given the current usages, I can't see a solution, other than option 1, that would work.
We also need to decide if the vales in
<url-pattern>
elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here. +1 And I'd prefer it to be an error if any other characters are encoded.
Agreed.
I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.
Hmmmmm maybe. But I think this just complicates things like asynchronous threads that call the path methods. Already it is a problem that an async thread looking at a request that is passed to a request dispatcher will see different values in a race with the handling of the request in and out of the dispatcher. If we make this per servlet, then an async thread will see different encoding values in a race. The real fix for this is to get rid of object-identity-wrapping so that async threads can be passed a container-wrapped request and the values they see from the path methods would never change.... but that is another discussion.
Ultimately, it comes down to if this mode is something valid going forward or just some legacy help to get applications ported to the new 6.0 behaviour. I think the later, so I think having a context wide mode is fine. If there are really some filters/servlets that can run in that mode then we can provide utility wrappers that will take a 6.0 request running in the new mode and decode any reserved characters to provide the <6.0 behaviour (in fact I was already thinking of providing this mode with a container provided wrapper prior to the first dispatch - hence not making core code horridly modal).
As above, I think making legacy mode support outside the spec container specific simplifies things a lot.
I think the new methods I'm proposing can be justified even without this problem. Too frequently are
servletPath
andpathInfo
recombined to apathInContext
. The resource methods really should be usable withURI
and/orPath
rather than the antiqueURL
. They a good improvements anyway, and I just really a subject in this discussion because they can be specified with the new better behaviour and never need to be modal.
I'm coming around to the idea of these. I'm not quite convinced yet though as I wonder how much of the requirement for pathInContext
is caused by the inconsistencies we are working on solving here.
I think the final part of this puzzle is
HttpServletMapping
. If we choose option 1 for mapping requests then that would mean the current behaviour ofHttpServletMapping
is unchanged. Or option 5.
See above.
I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.
How about prior to implementing in Tomcat we create a branch of the api/spec and a draft PR. We can word craft the spec/javadoc/api changes we are describing above in that PR and that will ensure we are really on the same page (I think we are currently in the same chapter of the same book, so that's good). In parallel we can try out that branch in Tomcat and Jetty.
Yes, I think that lines up with what I was thinking. Given the history in this area, making sure we really are in agreement and not interpreting things differently is key.
@gregw Does this bring us close enough to start on a PR?
Sure! I'll put something up later today that we can all throw some mud at and see how it comes out! Stand by.....
This comment is an attempt to concisely summarise all the discussions and possible solutions for this issue prior to a meeting to discuss them.
Background The issues discussed here are a result of a combination of factors:
.
or ..
and not as the result of any prior decoding or stripping of parameters.pathInfo
, however I believe it is common practise to exclude all parameters from methods like pathInfo
.The Problems The problems that result from the current specification can be roughly grouped into 4 broad categories below :
What URIs are rejected by the container for accessing restricted resources?
The following are examples of URIs that should be rejected by the container, even though a strict reading of RFC3986 might indicate that they are not actually requesting the protected resource:
/context/foo/%2e%2e/WEB-INF/web.xml
/context/foo/..;/WEB-INF/web.ml
/context/WEB-INF%2Fweb.xml
/context/WEB-INF%00/web.xml
/context/WEB-INF%5Cweb.xml
/context/%2e%2e/%2e%2e/%2e%2e/%2e%2e/etc/passwd
/context/WEB-IN~1.DIR/web.xml
These URIs can be rejected by the container because the identified resource is known to be protected and/or outside of the docroot. However there is nothing in the specification that clearly states these should be rejected or how they should be (404? 400?). This defence can also have issues with reasonable usage of symlinks.
How URIs are matched to security constraint for resources?
Consider the following URIs against a forbidden security constraint of /secret/*
:
/context/foo/%2e%2e/secret/private.xml
/context/foo/..;/secret/private.xml
/context/secret%2F/private.xml
/context/secret%00/private.xml
/context/secret%5Cprivate.xml
/context/sEcRet/private.xml
(maybe out of scope for this issue?)/context/SECRET~1.DIR/private.xml
(maybe out of scope for this issue?)These do not match the /secret/*
security constraint, but are typically rejected by the containers default servlet because the identified resource is known to by a different canonical name than the one requested. But if the request is not handled by the containers default servlet, then this protection does not apply.
How might common libraries incorrectly interpret the strings returned from getServletPath()
and getPathInfo()
?
If the examples in 2. are not considered to match the security constraint then the strings returned from the request method may be passed to library methods such as File
, Path
, or other third party libraries. If the URIs are fully decoded, then all of the example URIs contain characters that are likely to be misinterpreted by the library methods:
%2e%2e
or ..;
will be seen as just ..
segments and treated as a relative dot-segment by File
and Path
%2F
will look like /
and will be treated as a separated by all uri/path methods%5C
will be seen as a \
and may be treated as a separate on some file systems%00
or similar special character may be ignored or treated specially by some file systems or libraries that call into native code. ;
or @
that may have similar problems.How the container methods like getResource(String)
or `getRequestDispatcher(String) might interpret strings obtained either from the container, from the application or a combination of both?
URIs that are deemed safe by 1., 2. and 3. may still be decoded to paths that are not safe to pass to some servlet-api methods:
ServletContext.getResource(String)
as the implementation is likely to do further normalization. Some containers protect from some of these cases by making getResource not return a resource if requested by a non-canonical name.ServletContext.getRequestDispatcher(String)
may be vulnerable to double decode attacks:
/context/foo/%252E%252E/WEB-INF/web.xml
which is decoded to /foo/%2e%2e/WEB-INF/web.xml
and then used to obtain a request dispatcher from getRequestDispatcher(String)
, where the string is interpreted as allowing characters to be encoded./context/secret%3fa=1/private.xml
will pass the /secret/*
security constraint but be decoded to /secret?a=1
. (maybe out of scope for this issue?)/context/SECRET~1.DIR/private.xml
will pass the /secret/*
security constraint but is 8.3 name for /secret
directory. (maybe out of scope for this issue?)Possible Solutions These are some possible solutions, some of which are already implemented by some containers:
%2F
or %2e%2e
in its URI is rejected. This will make existing applications safe, but could prohibit some applications from validly accessing these URIs. Some containers already do this.getRequestURI()
or new APIsgetResource
to only return resources if requested by canonical path. This can protect applications but can cause frustrations with applications that wish to run on case insensitive file systems or to use synlinks. Some containers already do this.Edit: added case and 8.3 name alias examples. May be out of scope for this issue, but are kind of related as can have some of the same defences.
@markt-asf @stuartwdouglas I've sent an invite for a call time... if that works for you guys, we can open up to all.
With regards to the aims of the changes, I think lots of far ranging ideas have been discussed, from minimal to significant changes. I think we need to focus the discussion on the following:
That works for me. Thanks for setting it up. I agree we need answers to those questions. As we have found, everything is inter-related so it is hard to discuss any one of those in isolation. I think there are a couple of relatively simple issues we can discuss first that - assuming we agree an approach - should reduce the complexity of what is left. My current thinking is based around looking at how we get from a received URI to mapping a request to a servlet, filters and security constraints. If you think of this process as a series of stages then answers to the questions 1 to 3 above then take the form of "take the URI as it is at stage X and return the relevant segments" or "the method expects a URI in the form expected at stage Y".
The following table might also focus our discussion.
Consider the following URIs with a forbidden security constraint at /secret/*
.
The following files exist in the docroot in a case insensitive file system:
/dispatch/*
that forwards request to the string return from pathInfo()
URI | Container | get[Servlet]Path[Info] | DefaultServlet |
---|---|---|---|
/public/file.txt | /public/file.txt | 200 | |
/public//file.txt | /public//file.txt | 200 or 404 non canonical ? [1] | |
/PUBLIC/file.txt | /PUBLIC/file.txt | 200 or 404 non canonical ? | |
/public%2Ffile.txt | 400? [3] | /public/file.txt or throw [4] | 200 which file ? |
/public%5Cfile.txt | 400? [3] | /public\file.txt | 200 which file? |
/public%00/file.txt | /public␀/file.txt | 404 | |
/public/./file.txt | /public/file.txt | 200 | |
/public/.;/file.txt | 400? [3] | /public/file.txt | 200? |
/public/%2e/file.txt | 400? [3] | /public/./file.txt | 404 non canonical? |
/public/%2e;/file.txt | 400? [3] | /public/./file.txt | 404 non canonical? |
/../docroot/public/file.txt | 404 | ||
/public/dir/../file.txt | /public/file.txt | 200 | |
/public//../file.txt | /public/file.txt | 200 | |
/public/dir/..;/file.txt | 400? [3] | /public/file.txt | 404 non canonical? |
/public/dir/%2e%2e/file.txt | 400? [3] | /public/../file.txt or throw? [4] | 404 non canonical? |
/public/dir/%2e%2e;/file.txt | 400? [3] | /public/../file.txt or throw? [4] | 404 non canonical? |
/WEB-INF/web.xml | 404 | ||
/web-inf/web.xml | 404 | ||
/WEB-IN~1.DIR/web.xml | 404 non canonical ? | ||
/WEB-INF;/web.xml | 404 | ||
/WEB-INF%2Fweb.xml | 404??? | ||
/WEB-INF%5Cweb.xml | /WEB-INF\web.xml | 404 or 404 non canonical? | |
/WEB-INF%00/web.xml | /WEB-INF␀/web.xml | 404 non canonical ? | |
/WEB-INF/./web.xml | 404 | ||
/public/../WEB-INF/web.xml | 404 | ||
/public/..;/WEB-INF/web.xml | 404 | ||
/public/%2e%2e/WEB-INF/web.xml | /public/../WEB-INF/web.xml | 404 non canonical? | |
/public/%2e%2e;/WEB-INF/web.xml | /public/../WEB-INF/web.xml | 404 non canonical? | |
/secret/private.xml | 403 | ||
/SeCreT/private.xml | /SeCreT/private.xml | 404 non canonical | |
/SECRET~1.DIR/private.xml | 404 non canonical? | ||
/secret;/private.xml | 403 | ||
/secret%2Fprivate.xml | 403? [2] or 400? | /secret/private.xml | ??? |
/secret%5Cprivate.xml | 403? [2] | /secret\private.xml | ??? |
/secret%00/private.xml | 400? | /secret␀/private.xml | 404 non canonical? |
/./secret/private.xml | 403 | ||
/.;/secret/private.xml | 403 | ||
/%2e/secret/private.xml | 400? | /./secret/private.xml or throw? | 404 non canonical? |
/%2e;/secret/private.xml | 400? | /./secret/private.xml or throw? | 404 non canonical? |
/public/../secret/private.xml | 403 | ||
/public/..;/secret/private.xml | 403 | ||
/public/%2e%2e/secret/private.xml | 400? | /public/../secret/private.xml or throw? | 404 non canonical? |
/public/%2e%2e;/secret/private.xml | 400? | /public/../secret/private.xml or throw? | 404 non canonical? |
/dispatch/public/file.txt | /public/file.txt | 200 | |
/dispatch/public%2Ffile.txt | 400? | /public/file.txt | ? |
/dispatch/public%5Cfile.txt | 400? | /public\file.txt | ? |
/dispatch/public%252Ffile.txt | 400? | /public%2Ffile.txt | ? |
/dispatch/WEB-INF/web.xml | /WEB-INF/web.xml | 200 | |
/dispatch/secret/private.xml | /secret/private/xml | 200 | |
/dispatch/%2E%2E/%2E%2E/etc/password | 400? | /../../etc/password | 500 |
[1] Should the 404 non canonicals be redirections to the canonical name or is that leaking info? [2] Can we somehow determine the canonical name prior to applying security constraints? [3] For suspect URIs the container can reject with 400. But if container is configured to allow through, then do we need to define behaviour? [4] If suspect URIs are allowed into the container, should we throw if unsafe methods are used to access them?
Considering how the current HTTP/1.1 spec has no support for the concept of a path parameter, should it's role be HTTP version specific? Such as having HTTP/2 and HTTP/3 should not treat the ';' as a path parameter?
@markt-asf @stuartwdouglas I started editing the wiki page but soon realised I have more questions that need to be discussed before I can write proposed text.
I think we are agreed that we wish to define the default behaviour, plus typical deviations from that which may be allowed by some containers. Given that, I believe those configurations should be a on a per context basis, as different webapps on the same server may have different criteria for URI handling. Given that, we need to define exactly when URI canonicalization takes place with regards to mapping to contexts. Moreover, I don't think we can define rejections within the canonicalization process, because some contexts may not reject... however a context may reject a request due to something found during prior canonicalization (as an optimization rather than re-parsing).
In general, I think the process needs to be something like:
/
and %
(must always contain %
if non empty). See examples below.%2f
, \
, %5c
, /.;
, /..;
, , /%2e
, /%2e%2e
, perhaps some more. Containers can configure a context's reject-set.getServletPath
and getPathInfo
methods to throw rather than return. By default this will be identical to the reject-set and thus by default an application will never throw from these methods. Only if a char/sequence/segment is removed from the reject-set, but not removed from the throw-set may these methods throw.So some examples:
%2F
returned from getPathInfo()
then the container would be configured to add '/' to the leave-encoded-set and the context configured to remove %2f
from the reject-set and throw-set of the context that would handle them. Other contexts would still be safe from %2F
as by default it is in their reject set.%2F
but only from getRequestUri()
, then the container would be configured to add '/' to the leave-encoded-set and the context configured to remove %2f
from the reject-set, but left in the throw-set.I don't like having the leave-encoded-set a container rather than context configuration, but I can't see how else to do it.
Also I have a concern/question about getContextPath()
and how that is specified to return the non-decoded context. What does that mean for a URI like /foo/../context/servlet/info
? Is the context path for that request /context
or /foo/../context
?
I need to think a bit more how to best specify the reject-set and throw-set, as they need to deal with complexities like:
/%2e%2e',
/%2e%2E', /%2E%2e',
/%2E%2E', %2e
is harmless if not a segment)/..
is only dangerous if follows by a ;
)
It may be best just to describe the set and leave it to implementations how to implement and configure?I wonder if trying to configure this per context we are making things more complex for the large majority to support something only a very small minority will ever need. Few users will need these features. Even fewer will need different settings per web application. Most deployments are one app per Servlet container. Where there is more than one app deployed it is often because the apps are related (and will likely need the same configuration). Would it be better to say for the small number of users that need different configurations, deploy multiple instances and use a reverse proxy?
@markt-asf if we specify normalization in a way that can be applied per context, then there is nothing stopping implementations that wish to deploy only a single context from implementing it per container. However if we specify normalization in a way that is per container, then it makes it more difficult for implementations to apply per context if they wish.
I don't think the cost is significant. We only need to define normalization well enough until the point that a context can be unambiguously identified before we say any requests are rejected. If that is the default behaviour for all contexts, then there is no difference between a container that actually rejects during normalization vs one that waits until identifying the context before doing so. Since we are not intending to define how a container is configured for non-default behaviour, then impls are free to make that configuration at container level or context level, so long as the spec does not mandate container level rejections.
IMHO per context configuration is something that should be a container specific option. I don't think there will be much demand for it in the real world, so I don't think it should be part of the spec.
We actually do support this in Undertow, but we do it by mapping the unencoded URL against context paths, once it is dispatched to the context then that particular context can decode it as required (i.e. you need to specify the context path in terms of the raw encoded bytes). The main use case we had for this was to support different URL encodings per application, although thankfully that seems to be a lot less common these days.
@stuartwdouglas I think you are misunderstanding me. I'm not saying that we should define per context configuration in the spec. I'm just saying that in the steps that we do define, we should leave any rejecting-requests steps until after the mapping-to-context step, so that an implementation has the option of implementing it either as per container or per context.
So in my comment above, I describe reject_set and throw_set, but I don't actually say if they are per container or per context.... but the fact that they are not used until after a context is mapped means that an implementation at least has the option of making them per context. [ edit - oh OK I did actually say "Contexts may ..." in my comment.... but I didn't actually need to ]
Ah. I think I had misunderstood your meaning as well. How about the following:
Given the 2nd paragraph, I think this provides enough flexibility for per container and per context configuration.
@markt-asf that works for me.
But also with the language for the new step 8, rather than say:
URIs containing the following sequences must be rejected...
I'd prefer instead it said something like:
URIs containing sequences contained in the reject-set must be rejected.... By default the reject-set contains the following sequences:.... Containers may add or remove sequences from the reject-set.
I know this is a bit more cumbersome, but it allows simpler and more standard deviations with container specific configurations - for example containers can define how to add/remove sequences from the reject-set without then needing to specify new behaviours. I'm not wedded to the actual name/text. I just think that defining a class of behaviour and then defining the specifics is more flexible than defining sequence by sequence behaviours.
So just to clarify, would step 2 still reject by default, unless the relevant item was part of the container level leave-encoded-set?
In particular, how would you handle the following scenario with two contexts:
/app/public/
configured to allow %2F/app/private/
configured to disallow %2FHow do we handle a request to /app/public/..%2Fprivate
?
@stuartwdouglas No, step 2 would only decode non-reserved characters. Rejecting would be done in step 8... oh I think I missed @markt-asf numbering... I think we map first (step 8?) and then reject (step 9?).
Eitherway, in your example the URIs would be:
So if /public
was configured to allow %2F, then /app/public/..%2Fprivate/file.txt
would be mapped to /public
before any decoding of reserved characters. The servletPath+pathInfo in that context would be /../private/file.txt
(or possible /..%2Fprivate/file.txt
if container configuration was also used to leave reserved characters encoded) and it would be up to that context to deal with that URI how it likes. I would expect that if it flowed through to the default servlet that a 404 would result because a) non-canonical b) goes outside/above docroot.
You at least need to map to Context before you reject. You can map to Filters and Servlets at the point you map to context or after rejection. The description would be simpler if all the mapping happens as once - although implementations still have the option to do things differently if they wish.
The handling of that examples depends on what container specific options are provided / chosen. Keeping Tomcat close to what it does now, if %2F is allowed Tomcat will leave it in encoded form so /app/public/..%2Fprivate
will get mapped to the /app/public
context. ServletPath + PathInfo would be /..%2Fprivate
. What happens after that will depend on Servlet mappings. If it got as far as the default Servlet I'd expect a 404.
Ok yea, I did not really think that example through as once it is mapped to a context it can't escape the context.
One issue is does raise is that you end up with a path of /../private/file.txt
, which should not really be possible under our normalization rules. It would be easy to think 'we don't need to care about directory traversal attacks, because the container normalisation process handles this for us', but in this case it would not be true.
TBH I am actually having a bit of a hard time following what the current state of the proposal is, because it is spread out over different comments referencing back to the wiki.
@stuartwdouglas @markt-asf I've had a go at update the wiki. I've adding in that we have to convert octet sequences to characters with UTF-8.
I wish there was a way to comment & suggest on the wiki like there is with PRs. Maybe we should move this to another PR so we can do so?
I think a PR would be easier. I mostly agree with everything there (including the TODO's that need more work), and I have added an additional TODO: if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed?
I'll move to a PR.... stand by.....
I have created #424.
Sorry @markt-asf, I've kind of hacked the wiki contents around a fair bit... and not yet necessarily for the better. Hopefully the PR tools will allow us to iterate and refine.
The final PR included an update to the Javadoc for HttpSservletRequest.getContextPath() that stated the return value would be canonicalized. However, this method has always returned exactly what the client sent. No canonicalization. No decoding. This looks like an error on my part when I added the final updates to the Javadoc. Do we want to remove the canonicalization text?
We've always interpreted this as returning the encoded context path, but what is configured on the server rather than what the client sent. I think returning whatever the client sent is a bit non deterministic.
We have a narrow window of opportunity to fix this for 6.1 since I need to re-roll the release to fix the Javadoc copyright dates. I intend to simply remove the "The path will be canonicalized as per..." sentence that was added unless we managed to agree an alternative text very quickly (today).
My understanding of the expected behaviour is that for a context path of /aaa this method behaves as follows: |
Request URI | Context Path |
---|---|---|
/aaa/index.html |
/aaa |
|
/a%61a/index.html |
/a%61a |
|
/bbb/../aaa/index.html |
/bbb/../aaa |
The no decoding (2nd line) is explicit in the original Javadoc.
No canonicalization (3rd line) was implied. From memory that requirement originated from a need to be able to strip the context path from the return value of getRequestURI()
.
My proposal for a change is to align it with getRequestURI()
. ie:
I applied the simple fix that removed the "The path will be canonicalized as per..." sentence and re-tagged. We can agree the new wording for 6.2 in slightly slower time.
I'm good with the simple fix for 6.1. I'm still cautious about the idea of returning what the client sent for the context path, least not that it is actually difficult to work out. But that's an ambiguity that had existed before asked nobody had complained that jetty returns /aaa for all your examples. We only return an encoded character if it needs to be encoded in they context path. But agree we can sort that out in 6.2
The specification is unclear for some or all of the above methods how the following should be handled.