Closed jonenst closed 3 years ago
I've edited your comment to improve the formatting. You might want to check out this Mastering Markdown guide for future reference.
Thanks. Sorry for the bad formatting.
I found another place where '{' slips through (and in addition to '}' slipping through, the wrong name must be passed as the key in the uriVariable map to avoid getting an exception):
jshell> UriComponentsBuilder.fromUriString("/{foo{}}").encode().build().expand("bar")
$132 ==> /bar}
jshell> UriComponentsBuilder.fromUriString("/{foo{}}").encode().build().expand(Map.of("foo{","bar"))
$133 ==> /bar}
# the key in the map should be foo{} !
jshell> UriComponentsBuilder.fromUriString("/{foo{}}").encode().build().expand(Map.of("foo{}","bar"))
| Exception java.lang.IllegalArgumentException: Map has no value for 'foo{'
# throws :(
This one is because UriTemplateEncoder chooses the variable name to be "foo{}" by correctly tracking the nesting level and so correctly doesn't encode this variable name, but the expandUriComponent uses Pattern.compile("\{([^/]+?)}") to find names to replace and so stops at the first '}'. Normal regexps can't track arbitrary high nesting levels, so there's no easy fix for this one while still allowing names of variables to contain correctly nested brackets. Note that as soon as a variable name is incorrectly nested, it becomes considered as a literal (with it's surround '{' and '}'):
jshell> UriComponentsBuilder.fromUriString("/{foo{}}").encode().build().expand("bar")
$140 ==> /bar}
# correctly nested
jshell> UriComponentsBuilder.fromUriString("/{foo{}").encode().build().expand("bar")
$141 ==> /%7Bfoo%7B%7D
# incorrectly nested, treated as a literal, bar is not used
jshell> UriComponentsBuilder.fromUriString("/{foo}}").encode().build().expand("bar")
$142 ==> /bar%7D
# incorrectly nested, closes too soon, the remaining '}' is treated as a literal
There are more weird manifestations of the discrepancy the UriTemplateEncoder code and the expandUriComponent code:
jshell> UriComponentsBuilder.fromUriString("/{foo{}{}}").encode().build().expand("bar")
| Exception java.lang.IllegalArgumentException: Not enough variable values available to expand '}'
# this should have worked, a single variable "foo{}{}" gets its value from the positional argument.
# But it throws about a bogus second variable.
jshell> UriComponentsBuilder.fromUriString("/{foo{}{}}").encode().build().expand("bar", "xxx")
$144 ==> /barxxx
# correctly nested, so UriTemplateEncoder treats it as one unencoded variable
# but the regexp in expandUriComponent matches 2 variables. Chaos ensues.
We are aware of the fact that it's impossible to correctly parse a string into a URL using a regex. But I would argue a even stronger case: it's impossible to correctly parse URL strings at all, for a number of reasons:
The typical developers idea of what a URL is differs significantly from the relevant specifications. We have no intention on trying to enforce these specs all of our users, and instead try to be more pragmatic instead. For instance, this leads to us supporting url template placeholders where strictly they are not allowed.
Secondly, there is URL encoding. It is impossible to distinguish between encoded and unencoded components, and even if you could, there are too many encoding choices available to guess correctly 100% of the time. Again, we could have decided to only support the specifications, but instead we try to be more lenient where we can.
Overall, the goal of the UriComponentsBuilder
is to be useful for Spring Framework itself, and its users. The goal is not to strictly conform to URL RFCs, though we try to do so where we can. The parse
methods on the UriComponentsBuilder
are essentially best guesses, with absolutely no guarantee of parsing the given string correctly, because that is impossible anyway—whether using a regex or not.
Given the above, I am not at all surprised that the builder has difficulty parsing strings like \"<>\\^
|][%{}or
/{foo{}{}}`. In fact, I am surprised the former is even accepted at all, because it's not a valid URI. Both simply have too much ambiguity, as we do allow placeholders. If we would resolve this and not allow placeholders in those places, we would break the vast majority of the Spring apps out there. So that is not an option.
However, there is a way to resolve this. As the name of the class suggests, the UriComponentsBuilder
can also be used as a builder, where you can supply individual components (scheme
, host
, path
, etc) one at a time. Used in this way, there is less ambiguity to solve, so the builder generally does a better job.
I think I see where the issue is and it's related to the encode()
at the UriComponentsBuilder
level. That encodes the template while it still contains URI variables, while URI variables themselves are encoded more strongly later (i.e. all URI reserved characters as opposed to just legal ones) when expanded. This means that if you don't expand, you can end up submitting a URI with what was expected to be placeholders.
@poutsma Thanks for taking the time to explain. I used \"<>\\^|][%{}
just to show that all other characters work fine, not to trigger an obscure behavior. using "{}" alone is enough to trigger the bug. All in all, maybe you're right, and your explanation can be added to the docs ?
Thank you all for your time
Another case where '{' slips through is the UriComponentsBuilder.toUriString() shortcut when we forgot a variable (but supplied at least one other variable) :
jshell> UriComponentsBuilder.fromUriString("/{a}/").uriVariables(Map.of("a","X")).toUriString()
$39 ==> "/X/"
# OK
jshell> UriComponentsBuilder.fromUriString("/{a}/").uriVariables(Map.of("b","X")).toUriString()
$36 ==> "/{a}/"
# invallid
On closer look, UriComponents#toUriString()
is nothing but concatenation of the current state of the individual parts:
* Concatenate all URI components to return the fully formed URI String.
* <p>This method does nothing more than a simple concatenation based on
* current values. That means it could produce different results if invoked
* before vs after methods that can change individual values such as
* {@code encode}, {@code expand}, or {@code normalize}.
In other words it is expected that variables may not have been expanded nor encoded and that would be reflected in the resulting String. The toUri()
method on the other hand does validate and reject illegal characters.
You're right there are some corner cases with nested and/or mismatched placeholders but I don't see any easy solutions.
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.
hi @rstoyanchev , what feedback do you want from me ?
I agree that the behavior of toUriString() is clearly documented as beeing only a concatenation depending on whether you already called .encode() or not, so that's good. But to me the problem are those edge cases where '{' slips through even though you called .encode(). So to me, the solution is either to fix the edge cases, or to document the edge cases. And if the edge cases are documented, then also document that calling .toUri() instead of toUriString() does a final validation (by redecoding and reencoding) which allows to ensure that the user is not in one of the edge cases ?
I see 2 edge cases: unexpanded variables, and weird mismatches between the regex code and the escaping code.
unexpanded variables:
# using encode_template but not calling expand() but the template
# contains curly brackets (maybe the template author didn't think of variables)
# emtpy or not empty matched curly brackets slip through
jshell> UriComponentsBuilder.fromUriString("{}").encode().build().toUriString()
$2 ==> "{}"
jshell> UriComponentsBuilder.fromUriString("{a}").encode().build().toUriString()
$3 ==> "{a}"
# In this case, document that when you don't call expand(),
# you should encode the result instead of encode_template
# ie ".build().encode()" instead of "encode().build()"
# The current docs currently says that .encode().build is almost always better
# but when you are not calling expand() I think it's worse.
# using encode_template and calling expand, but still empty curly brackets slip through
UriComponentsBuilder.fromUriString("{}").encode().build().expand().toUriString()
$5 ==> "{}"
# document this or fix it ?
# The UriComponentsBuilder.toUriString() shortcut uses
# encode_template but allows for missing variables:
jshell> UriComponentsBuilder.fromUriString("/{a}/").uriVariables(Map.of("b","X")).toUriString()
$15 ==> "/{a}/"
# fix it to disallow missing variables ? Or document that missing variables
# will result in invalid chars in the result string ?
# I don't see how a resulting string with missing variables is useful,
# you can't safely use it as a new template
# because it's been partially encoded, so you will either
# have doubly encoded parts or non encoded parts.
weird stuff, unfixable ? Remove the nesting parsing code and document that variable names should not contain '{' and '}' ?
# mismatch between the parsing during encoding and the regexp for the variables
# last '}' slips through
jshell> UriComponentsBuilder.fromUriString("/{foo{}}").encode().build().expand("bar").toUriString()
$7 ==> "/bar}"
Thanks for the feedback.
I suppose we can tighten the encoding of the template, which has to work around placeholder, to disallow empty, nested, or mismatched braces, and expect those to be expanded via URI variables instead if actually needed.
That should only leave cases where there are actual URI variables, or what looks like it, including the case with partial expansion via UriComponentsBuilder#uriVariables
, which by the way is a feature we can't disallow it.
That's probably good enough. I mean if code somehow forgets to do expand and ends up with a String that contains braces, wouldn't it still need to construct a java.net.URI
which is needed at the end in most places? That should then reject it.
I don't know all the possible APIs, but the old school plain java using java.net.URL.openConnection doesn't check:
jshell> URL url = new URL("http://localhost:12345/foo{"); HttpURLConnection urlConnection = (HttpURLConnection) url.openConnection(); urlConnection.getInputStream().readAllBytes() ;
I see
$ nc -l 12345
GET /foo{ HTTP/1.1
User-Agent: Java/11.0.9
Then again, fixing what is easy to fix and documenting the rest seems like a good solution
I've made updates to UriComponentsBuilder#encode
to not treat placeholders without text or with nested curly braces as URI variables and to encode them as a URI template literal instead. This also aligns better with the expanding which stops at the first closing '}'. The documentations is also updated to recommend UriComponents#encode
when no variables are expanded.
Thanks, I think you found the best solution.
Using the
UriComponentsBuilder
, the{
and}
characters can end up in the result if you are not careful (they are the only ones from the invalid printable ascii chars which do this, most probably because they are used for templates, like in{city}
).Using
toUri()
instead oftoUriString()
at least does check and throws an exception in the bad case.Using
toUri()
and removing.encode()
actually makes it encode:With
buildAndExand()
, things are a bit safer, but still there are cases where it lets unencoded chars through.