w3c / rdf-star

RDF-star specification
https://w3c.github.io/rdf-star/
Other
118 stars 23 forks source link

(tracking) need explicit notes that `<<` and `>>` are not any of the visually similar characters #282

Open TallTed opened 11 months ago

TallTed commented 11 months ago

This is necessary for at least Turtle, N-triples, and N-quads. I have not created distinct issues.

The syntax discussions and EBNFs all use the right characters, but I have located nothing that explicitly states that << and >> are not any of the several visually similar characters. (Degree of similarity varies with font, among other things.)

Correct
<< two LESS-THAN SIGN
>> two GREATER-THAN SIGN
Some of the incorrect
« one LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
» one RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
‹‹ two SINGLE LEFT-POINTING ANGLE QUOTATION MARK
›› two SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
❮❮ two HEAVY LEFT-POINTING ANGLE QUOTATION MARK ORNAMENT
❯❯ two HEAVY RIGHT-POINTING ANGLE QUOTATION MARK ORNAMENT
〈〈 two LEFT ANGLE BRACKET
〉〉 two RIGHT ANGLE BRACKET
gkellogg commented 11 months ago

It's not much different than "<" and ">" for IRIREF, which is long standing. If you look at the grammar, quotedTriple uses specific unicode characters for "<<" and ">>". We could add something when they are introduced in 2.2 Quoted Triples that clarifies this, but we never felt the need to do so for "<", and where would such over descriptiveness end?

I could imagine that some editors might automatically replace "<<" with "«" when typing, much as '"' is often replaced with '‟' (DOUBLE HIGH-REVERSED09-QUOTATION MARK) or '〞' (DOUBLE PRIME QUOTATION MARK), but that's a different problem.

TallTed commented 11 months ago

Whether this is "over descriptive" depends on the reader and writer, in my opinion. Where "such over descriptiveness [would] end" seems logically to be the Unicode code point of such characters. Specifying the characters which reveal that Unicode code point would certainly feel like over-specification, akin to double-escaping of URL characters.

Of course, participants in this WG ought not need this degree of specificity — but some number of us have confusingly referred (and I daresay will refer in future) to the double-{less|greater}-than as "chevrons".

Auto-replacement by editors is another potential headache, and seems to me to be worthy of a note in one or more documents, as this would not have been an issue with the single-{less|greater}-than characters used in IRIREF constructs.

It might be sufficient to make this explicit in the EBNFs, where a number of characters are already explicitly identified by their Unicode code points ... and where these characters are not now explicitly identified, though the ntriples-bnf.html source does use &lt; and &gt; entities in quotedTriple--

    <tr id="grammar-production-quotedTriple">
      <td>[7]</td>
      <td><code>quotedTriple</code></td>
      <td>::=</td>
      <td>"<code class="grammar-literal">&lt;&lt;</code>" <a href="#grammar-production-subject">subject</a> <a href="#grammar-production-predicate">predicate</a> <a href="#grammar-production-object">object</a> "<code class="grammar-literal">&gt;&gt;</code>"</td>
    </tr>

The specific Unicode characters are not now made explicit, nor is the HTML markup now visible, where humans would reasonably be expected to consume these.

I've created PR#36 on rdf-n-triples for this. If that works for all, it can be echoed on rdf-n-quads and rdf-turtle

afs commented 11 months ago

Unicode Character Names (the formal name) are just one set of naming or characters. I'm surprised how many of these names are mis-aligned with some common community usage and practices so readers will not have heard of the Unicode Character name.

We can avoid further confusion by using the Unicode codepoint value, which is succinct, unambiguous to the reader and has a established way to write it.

TallTed commented 11 months ago

I think this is not far removed from the LANGTAG specification where ^^ are (now) clearly documented in the body text, but are not yet so in the EBNFs.

I firmly believe that all of ^, <, and >, both singular and doubled, should be explicitly identified in all of the EBNFs.

afs commented 11 months ago

We should avoid the name which is confusing as there are several alternatives for each. The Unicode formal name is not always the one people use, it is just one amongst many.

Just give the codepoint U+.... which is what matters.

EBNF is a formal format that has no way to put in commentary. But maybe you mean something else by "EBNFs"?

TallTed commented 11 months ago

No, I believe I mean the same EBNF as you think I do, e.g., ntriples.bnf.

I don't meant to add commentary there, but simply to explicitly identify the Unicode characters we've been discussing in the relevant places. For instance --

quotedTriple      ::= '<<' subject predicate object '>>'

-- might become --

quotedTriple      ::= #x003C#x003C subject predicate object #x003E#x003E

I do suggest that we do this for all instances of these characters through the EBNFs, which I know some will consider overkill, but visual identification of characters like these is simply not reliable, unlike characters like #, $, %, etc. (That said, if you know of visually confusing alternative characters for these, then I would do the same Unicode-based identification for these as well.)

I can live with leaving the character names out of the body text, though I think it would be better to include them (and to address your concern about multiple and/or similar names, we could be more explicit, e.g., instead of circumflex accent, we could say, Unicode CIRCUMFLEX ACCENT, <span class="codepoint">U+005E</span>)

afs commented 11 months ago

The outcome of using code points here, and all the other places you now propose, means that readers have a harder time relating examples they see on the web (material outside the control of this WG), and the examples in our specs, with the grammars.

Being clear at the point of definition is enough.

gkellogg commented 11 months ago

I agree with @afs, this will make readers job more challenging, and it is unnecessary IMHO.

The character sequence '<<' is unambiguous when processed for the purpose of parsing input as the Unicode characters effectively represent themselves, which is the whole point of the EBNF. It is only when reading (really just reading a printed page) that it could be ambiguous if '<<' is intended, or some alternative sequence of similar looking character(s). Adding U+003C in the narrative discussion may help clarify this, but there's really no evidence that this has proved to be a problem with this or similar sequences in the past.

TallTed commented 11 months ago

I have witnessed the confusion. Unfortunately, I don't recall whether it was in email, IRC, or otherwise, and I am not finding it easy to locate in my local logs. I will live with '<<', '>>', etc., in the EBNF, and brief mention of the Unicode code points in the narrative, until the confusion documentably resurfaces.