Closed sebastian-nagel closed 1 week ago
From the Javadoc of java.net.URI:
- The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.
- The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.
What I do not understand: if in the multi-argument constructors %
is always quoted, they cannot be used in situations where a percent-encoded character is mandatory. For example:
jshell> new URI("https://en.wikipedia.org/w/index.php?title=%26&redirect=no")
$63 ==> https://en.wikipedia.org/w/index.php?title=%26&redirect=no
jshell> new URI("https", "en.wikipedia.org", "/w/index.php", "title=%26&redirect=no", null)
$64 ==> https://en.wikipedia.org/w/index.php?title=%2526&redirect=no
jshell> new URI("https", "en.wikipedia.org", "/w/index.php", "title=&&redirect=no", null)
$65 ==> https://en.wikipedia.org/w/index.php?title=&&redirect=no
The single-argument constructor can do this and the getRaw...
methods allow to access the components with no percent-encoded characters decoded:
jshell> URI uri = new URI("https://en.wikipedia.org/w/index.php?title=%26&redirect=no")
uri ==> https://en.wikipedia.org/w/index.php?title=%26&redirect=no
jshell> uri.getPath()
$71 ==> "/w/index.php"
jshell> uri.getQuery()
$72 ==> "title=&&redirect=no"
jshell> uri.getRawQuery()
$73 ==> "title=%26&redirect=no"
Oh, that's pretty bad. I misunderstood the multiple argument constructor as only quoting what needed to be quoted.
I guess the only solution is to use the single argument constructor and if that fails manually quote whatever it failed on.
Fix released as v0.31.1.
I rewrote URIs.parseLeniently to use the single-argument constructor and if that throws percent encode just what's necessary, avoiding double encoding characters that are already percent encoded and then doing a URI.create(). It can still throw in some cases but at least the common scenarios like spaces and square brackets in paths and query strings should be handled.
I've added a note to the javadoc of record.targetURI() encouraging the use of .target() instead unless you really need a URI instance.
Thanks! I've run the new version over the sample of URIs where this issue was uncovered. For all 9 million URIs (from a crawl run during the last two weeks): no parse errors and successful round-tripping.
If the path or query component of a URI contains percent-encoded characters, these are modified by
URIs.parseLeniently(String uri)
, resulting in a different URI:More examples: