iipc / jwarc

Java library for reading and writing WARC files with a typed API
Apache License 2.0
48 stars 9 forks source link

URIs.parseLeniently modifies URIs containing percent-encoded characters #91

Closed sebastian-nagel closed 1 week ago

sebastian-nagel commented 1 week ago

If the path or query component of a URI contains percent-encoded characters, these are modified by URIs.parseLeniently(String uri), resulting in a different URI:

jshell> import org.netpreserve.jwarc.URIs;

jshell> URI uri1 = new URI("https://dx.doi.org/10.1038%2F35008096")
uri1 ==> https://dx.doi.org/10.1038%2F35008096

jshell> URI uri2 = URIs.parseLeniently("https://dx.doi.org/10.1038%2F35008096")
uri2 ==> https://dx.doi.org/10.1038%252F35008096

jshell> uri1.equals(uri2)
$4 ==> false

More examples:

https://www.prijmeni.cz/Kr%C3%A1kora
https://enfr.dict.cc/?s=c.-%C3%A0-d.
sebastian-nagel commented 1 week ago

From the Javadoc of java.net.URI:

  • The single-argument constructor requires any illegal characters in its argument to be quoted and preserves any escaped octets and other characters that are present.
  • The multi-argument constructors quote illegal characters as required by the components in which they appear. The percent character ('%') is always quoted by these constructors. Any other characters are preserved.

What I do not understand: if in the multi-argument constructors % is always quoted, they cannot be used in situations where a percent-encoded character is mandatory. For example:

jshell> new URI("https://en.wikipedia.org/w/index.php?title=%26&redirect=no")
$63 ==> https://en.wikipedia.org/w/index.php?title=%26&redirect=no

jshell> new URI("https", "en.wikipedia.org", "/w/index.php", "title=%26&redirect=no", null)
$64 ==> https://en.wikipedia.org/w/index.php?title=%2526&redirect=no

jshell> new URI("https", "en.wikipedia.org", "/w/index.php", "title=&&redirect=no", null)
$65 ==> https://en.wikipedia.org/w/index.php?title=&&redirect=no

The single-argument constructor can do this and the getRaw... methods allow to access the components with no percent-encoded characters decoded:

jshell> URI uri = new URI("https://en.wikipedia.org/w/index.php?title=%26&redirect=no")
uri ==> https://en.wikipedia.org/w/index.php?title=%26&redirect=no

jshell> uri.getPath()
$71 ==> "/w/index.php"

jshell> uri.getQuery()
$72 ==> "title=&&redirect=no"

jshell> uri.getRawQuery()
$73 ==> "title=%26&redirect=no"
ato commented 1 week ago

Oh, that's pretty bad. I misunderstood the multiple argument constructor as only quoting what needed to be quoted.

I guess the only solution is to use the single argument constructor and if that fails manually quote whatever it failed on.

ato commented 1 week ago

Fix released as v0.31.1.

I rewrote URIs.parseLeniently to use the single-argument constructor and if that throws percent encode just what's necessary, avoiding double encoding characters that are already percent encoded and then doing a URI.create(). It can still throw in some cases but at least the common scenarios like spaces and square brackets in paths and query strings should be handled.

I've added a note to the javadoc of record.targetURI() encouraging the use of .target() instead unless you really need a URI instance.

sebastian-nagel commented 1 week ago

Thanks! I've run the new version over the sample of URIs where this issue was uncovered. For all 9 million URIs (from a crawl run during the last two weeks): no parse errors and successful round-tripping.