w3c / web-annotation

Web Annotation Working Group repository, see README for links to specs
https://w3c.github.io/web-annotation/
Other
142 stars 30 forks source link

Percent-encoding in fragment identifier syntax #443

Open Treora opened 4 years ago

Treora commented 4 years ago

In the note about selectors and states, section 5:

The values SHOULD be percent encoded rfc3986; the encoding is a MUST for characters that may make the fragment ambiguous, namely:

character   code
space       %20
=           %3D
,           %2C
#           %23

The referenced RFC3986 defines the following grammar for a fragment identifier:

fragment    = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
pct-encoded   = "%" HEXDIG HEXDIG
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
              / "*" / "+" / "," / ";" / "="

In some of the note’s examples, some characters are not percent-encoded that are not actually valid in a fragment identifier.

Square brackets [ ] are are found in e.g. example 18:

http://example.org/page1.html
    #selector(type=XPathSelector,value=/html/body/p[2]/table/tr[2]/td[3]/span)

Angular brackets < > (which delimit URIs, so cannot be used inside them) are found in e.g. example 17:

http://example.org/page1.html
    #selector(type=CssSelector,value=%23elemid%20>%20.elemclass%20+%20p)

May this be worth a thorough review?

Parentheses

Moreover (feel free to factor out into a separate issue): Regarding the list of characters that must be percent-encoded, I wonder if it has been considered to include ( and ) here. Their presence may not strictly make the output ambiguous, but does make it more complex to parse.

For example, I might quote (but not crazy) in the text this is an artificial (but not crazy) example with a RangeSelector: (line breaks inserted for readability)

#selector(
  type=RangeSelector,
  startSelector=selector(type=TextQuoteSelector,exact=(but),
  endSelector=selector(type=TextQuoteSelector,exact=crazy))
)

Note that the total of three closing parentheses after crazy. I think one cannot decide how many of these are part of the cited string (one), and how many are part of the selector(…) syntax (two), without the parser either backtracking or keeping track of its recursion depth.

The proof of concept converter tool is based on PEG.js, which does not support backtracking, so if I am not mistaken cannot parse this. In fact, that tool does not allow parentheses in the values at all — see the last line of the source, where it defines validchar as any of a-zA-Z0-9<>/[]:%+@.-!$&;*_ (is this list based on a particular spec?).