w3c / sparql-dev

SPARQL dev Community Group
https://w3c.github.io/sparql-dev/
Other
124 stars 19 forks source link

Problems with numeric character escapes #77

Open cygri opened 5 years ago

cygri commented 5 years ago

TL;DR: The SPARQL 1.1 spec as written has three character escape mechanisms. One of them is weird, differs from Turtle, and makes injection attacks very hard to mitigate. Fortunately almost no one quite follows the spec in this case, and actual implementations are often quite close to Turtle and thus easier to secure. The SPARQL spec should be changed to match common implementations or to match Turtle.

Character escaping in SPARQL 1.1

SPARQL has three character escape mechanisms:

  1. String literal escapes: \",\',\n, \r, \t, \b, \f, \\
    • Only allowed in string literals
    • Prevents the special meaning of the escaped character
    • Defined as part of the SPARQL grammar in EBNF
  2. Prefixed name escapes: \ followed by one of ~.-!$&'()*+,;=/?#@%_
    • Only allowed in local part of a prefixed names
    • Prevents the escaped character from terminating the prefixed name
    • Defined as part of the SPARQL grammar in EBNF
  3. Numeric escape sequences: \uXXXX or \UXXXXXXXX with X being hex digits
    • Allowed anywhere in the query text
    • Does not change the meaning of the escaped character (hence better termed encoding, not escaping)
    • Applied “before parsing by the grammar defined in EBNF”

Mechanisms 1 and 2 are fine and fairly consistently implemented. Mechanism 3, on the other hand, not just allows obfuscating queries, but also interacts with the other mechanism in an interesting way. I was not able to find any implementation that actually implements what the spec says.

Differences between SPARQL and Turtle

String escapes and prefixed name escapes work the same in Turtle. But numeric escapes are different in Turtle:

  1. Turtle only allows them in string literals and in the <…> form of IRIs
  2. In Turtle, they remove the special meaning of a character. An encoded double quote \u0022 in SPARQL is still a string delimiter. In Turtle, it is not.

It is desirable to have SPARQL and Turtle syntax closely aligned. (The working groups for RDF 1.1 and SPARQL 1.1 tried to align this more, but this was not possible due to the different WG schedules.)

Some interesting test cases

I compared a number of implementations against the SPARQL 1.1 and Turtle specs.

  1. "\u": Is this a two-character string literal, or a syntax error due to incomplete numeric escape?
    • SPARQL 1.1, Turtle, Jena, Virtuoso, RDF4J, Blazegraph, Comunica, Virtuoso, rdflib: syntax error
  2. # \u and # \\u: comments to be ignored, or syntax errors due to incomplete numeric escape?
    • SPARQL 1.1: spec can be read either way
    • Turtle, Virtuoso, Rasqal, Comunica, rdflib: comments to be ignored
    • Jena, Blazegraph, RDF4J: \\u is ignored but \u is syntax error (?!?)
  3. "\x": two-character string literal, or syntax error because no such string escape?
    • SPARQL1.1, Turtle, Jena, Blazegraph, RDF4J, Virtuoso, Comunica, rdflib: syntax error
    • Rasqal: x (?!? - bug report)
  4. "\u0022": string literal containing a double quote, or syntax error due to unterminated literal?
    • SPARQL1.1, Jena, Blazegraph, RDF4J, rdflib: syntax error
    • Turtle, Virtuoso, Rasqal, Comunica: "
  5. ex:e\u0078ample: prefixed name “ex:example”, or syntax error because numeric escapes not allowed here?
    • SPARQL1.1, Jena, Blazegraph, RDF4J, Virtuoso, Rasqal, rdflib: ex:example
    • Turtle, Comunica: syntax error
  6. "\u005ct": two-character string literal “\t”, or one-character string literal consisting of a tab character?
    • SPARQL1.1, Jena, Blazegraph, RDF4J, rdflib: one character: tab
    • Turtle, Virtuoso, Comunica, Rasqal: two characters: \t
  7. "\\u0074": six-character string literal “\u0074”, or one-character string literal consisting of a tab character?
    • SPARQL1.1, rdflib: one character: tab
    • Turtle, Jena, Blazegraph, Virtuoso, RDF4J, Rasqal, Comunica: six characters: \u0074

Implications for SPARQL injection attacks

When an application creates SPARQL query strings by string concatenation, it is potentially vulnerable to SPARQL injection attacks. This attack vector is analogous to SQL injection.

To be safe, an application must apply the appropriate escape sequences to user data before building the query string. How to do this for Turtle is pretty obvious. For example, in Javascript (ES6), for use with the triple-quote """...""" and '''...''' string literal forms:

const escapeForTurtle: s => s.replace(/(["'\\])/g, '\\$1')

This just finds any single quotes, double quotes, and backslashes, and adds a backslash before them.

For SPARQL, if one follows the spec, this is much more difficult to get right because backslashes can be “smuggled in” as \u005c. Coming up with an escape function that is not just safe but also keeps user data intact without unitentionally losing or doubling some charcters is unreasonably hard. (I tried and gave up.)

Fortunately, as seen in test case 7 above, almost no vendor actually implements the spec as it is written, and what most vendors actually do implement is safe to use with escapeForTurtle().

Summary

There is a strong case for changing the SPARQL spec for numeric escapes to either match the Turtle spec, or at least to match the Jena/RDF4J behaviour where \\u0022 is not a numeric escape.

dbooth-boston commented 5 years ago

I noticed that the draft charter says that changes to W3C Recommendations are out of scope. https://w3c.github.io/sparql-12/charter.html But this issue looks like a good candidate for inclusion. Thoughts?

afs commented 5 years ago

The history of \u processing has a certain amount of accident about it. It went in early in SPARQL 1.0 and not revisited.

I wanted to change it at SPARQL 1.1 but the compatibility goal was considered stronger (this was before RDF 1.1 was active in the area - at least, Turtle went the better way and it was a conscious decision by the RDF 1.1 WG). I'm a bit surprised at the high level of compliance to the SPARQL spec; I expected the results to show more of the Turtle-like behaviour.

We have the opportunity here in the CG to include a wider range of users - normally it is the user-implementers speaking up. If that shows little concern for the compatibility issues, then this would be good issue/feature to document the change.

cygri commented 5 years ago

I updated the issue description with test results for RDFLib. It actually implements the SPARQL spec to the letter in this regard, the only implementation I found to date to do so. This slightly weakens the argument for changing the spec to match common implementations.

kasei commented 5 years ago

@cygri FWIW, Attean also aligns with the SPARQL 1.1 text for all 7 of your interesting cases.

barremian commented 1 year ago

Hi @cygri, could you please elaborate the following statement:

For example, in Javascript (ES6), for use with the triple-quote """...""" and '''...''' string literal forms:

Do you mean to say the escaped string must be enclosed in triple-quotes in Javascript? Do the triple-quotes have any significance in relation to preventing injections? Thanks