rdfjs / N3.js

Lightning fast, spec-compatible, streaming RDF for JavaScript
http://rdf.js.org/N3.js/
Other
714 stars 133 forks source link

Unicode escape sequence causes problems in the Solid ecosystem #248

Closed jaxoncreed closed 3 years ago

jaxoncreed commented 3 years ago

TLDR

Is there a reason why n3 converts the traditional javascript escape character (ie \ud83d\ude0a for 😊) to a 8-bit unicode escape sequence (ie \U0001f60a for 😊). The replace happens here: https://github.com/rdfjs/N3.js/blob/main/src/N3Writer.js#L383-L389.

It causes a lot of problems when trying to use N3 to parse a string with emojis. Granted, many of these problems may be the fault of SolidOS or NSS, but if it's all the same, would it be possible to use the javascript escape character rather than the unicode one?

The Problems

Problem 1: Strange backslashes when writing to a Pod.

Let's say I want to PATCH the following triple to my POD <https://pod.example/chat1.ttl#message1> <http://rdfs.org/sioc/ns#content> "Here's my emoji: 😊".

Parsing this using n3 turns this into <https://pod.example/chat1.ttl#message1> <http://rdfs.org/sioc/ns#content> "Here's my emoji: \\U0001f60a". Or at least that's what you would want to write to a Pod.

Running the following code

function patchToPod(uri: string, dataset: DatasetCore) {
  const writer = new Writer({ format: "N-Triples" });
  for (const quad of dataset) {
    writer.addQuad(quad);
  }
  writer.end(async (error, parsedString: string) => {
    // Parsed String is `<https://pod.example/chat1.ttl#message1> <http://rdfs.org/sioc/ns#content> "Here's my emoji: \U0001f60a".`
    fetch(uri, {
      method: "PATCH",
      body: `INSERT { ${parsedString} }`,
      headers: { 'content-type': 'application/sparql-update' }
    })
  });
}

causes the following to be written to the Pod (At least on NSS):

@prefix : <#>.
@prefix ch: <https://pod.example/chat1.ttl#>.
@prefix n: <http://rdfs.org/sioc/ns#>.

ch:message1 n:content "Here's my emoji: \uf60a".

Notice that instead of \U0001F60A it's \uf60a. I'm not 100% sure why this is because I'm not super versed in escape codes, but adding an additional slash to the parsed string in the code above fixes it:

fetch(uri, {
  method: "PATCH",
  body: `INSERT { ${parsedString.replace(`\\U`, `\\\\U`)} }`,
  headers: { 'content-type': 'application/sparql-update' }
})

So, this is fixable, it's just kinda weird.

Problem 2: Solid Clients don't understand 8-bit unicode escape sequences

Almost every client for Solid is written in JavaScript, so they understand the JavaScript escape sequence, but even when a unicode escape sequence is correctly formatted in the Pod, clients are unable to understand it. For example, here's what a message in the SolidOS chat looks like with unicode escape:

:fd0071f5-01f9-416d-aa01-24eb6666d618
    n:content "Emoji3 \\U0001f60a";

Causes: image

:fd0071f5-01f9-416d-aa01-24eb6666d618
    n:content "Emoji3 \U0001f60a";

Causes: image

But,

:fd0071f5-01f9-416d-aa01-24eb6666d618
    n:content "Emoji3 \ud83d\ude0a";

Causes: image

So, even if I were to build functionality into my app to parse unicode, it wouldn't be interoperable with the apps that do not.

Possible Solution: Use the JavaScript Escape Character

I'm open to any solution from someone who is more knowledgeable, but it seems that a very simple one is to just not convert emojis and other characters to unicode. The JavaScript escape works fine.

RubenVerborgh commented 3 years ago

Hi @jaxoncreed,

Thanks for the detailed investigation. TL;DR: I expect this to be an rdflib.js bug.

What we're talking about here are surrogate pairs, which are interpreted differently in UCS-2 and UTF-16. JavaScript needs to represent 🙂 as two characters, whereas Turtle offers us the possibility of representing this as a single escape sequence.

The most complete answer is probably that these things are in the Turtle test suite, and we test whether we can parse and reserialize correctly.

Is there a reason why n3 converts the traditional javascript escape character

I'm not an expert; it just seems more convenient to have the actual Unicode value of the character to be encoded rather than surrogate pairs. Also note that we do the inverse when parsing: we will create a surrogate pair when necessary.

Problem 1

Notice that instead of \U0001F60A it's \uf60a.

The SPARQL query is correct. Sounds like an rdflib.js bug.

adding an additional slash to the parsed string in the code above fixes it

Sounds like a second bug. Because the correct thing to be inserted then would be <BACKSLASH> U 0 0 1 F 6 0 A.

Problem 2

It's up to the parser to break down UTF-16 characters outside the BMP into surrogate characters.

Even if you wouldn't receive the input from N3.js as in this case, the string is still valid Turtle, regardless of who generated it. So the parser needs to be able to deal with it in any case.

It's important here that

:fd0071f5-01f9-416d-aa01-24eb6666d618
    n:content "Emoji3 \U0001f60a";

and

:fd0071f5-01f9-416d-aa01-24eb6666d618
    n:content "Emoji3 \ud83d\ude0a";

are equivalent when seen through a Unicode lens. I do not know whether they are also equivalent RDF graphs. That would be for the mailing lists to answer.

That's of course where your question comes from as to whether N3.js cannot generate the other. My question is instead: why can't rdflib.js parse it 🙂 Because if I adjust the serializer for rdflib.js, some other library that does not follow another part of the spec might be in trouble.

Possible Solution

I think we need to fix rdflib.js to handle all valid Turtle inputs.

rubensworks commented 3 years ago

For reference, tests were recently to the SPARQL test suite for handling such cases: https://github.com/w3c/rdf-tests/pull/65 SPARQL.js and Comunica pass these tests, so this might be a good reference on how this should be handled.

jaxoncreed commented 3 years ago

Okay, sounds good. I've opened an issue on rdflib here: https://github.com/linkeddata/rdflib.js/issues/494. I'll close this one in favor of that one.