Round tripping language tags case

ericprud commented 6 years ago

Related: #71

Apart from values set values with language tags, ShExC, ShExJ and ShExR can be exactly round tripped, c.f. schema tests. Because language-tagged literals are expressed as JSON-LD object literals and RDF parsers are not responsible for preserving upper/lower case in literal language tags, a ShExC schema:

<vs1> ["flat"@en-GB]

would be be translated to ShExR:

[] a sx:Schema ; sx:shapes <http://a.example/vs1> .
<http://a.example/vs1> a sx:NodeConstraint ;
  sx:values ( "flat"@en-GB ) .

An RDF parser is allowed to parse that as en-gb so it would round-trip to ShExC:

<vs1> ["flat"@en-gb]

This doesn't affect semantics of validation but it can be a pain for folks who like to follow ISO language code rules where regions should be upper case, i.e. en-GB. (This has little impact as no one uses ShExR anyways.) Round-tripping between ShExC and ShExJ (as JSON) is unaffected by this.

PROPOSE:

add a note in the spec documenting this as a round-tripping deficiency and stating that if this is a problem for users, future versions of ShExJ will not use JSON-LD object literals for value set values.
adopt https://github.com/shexSpec/shexTest/pull/25 which has additional schemas which differ only in language tag case.

ericprud commented 6 years ago

Alternate choice: leave no expectation of case-preservation when converting ShExC<->ShExJ

PROPOSE:

add a note in the spec documenting that no round trips preserve case and stating that if this is a problem for users, future versions of ShExJ will not use JSON-LD object literals for value set values.
reject https://github.com/shexSpec/shexTest/pull/25 and add pair-wise mixed-case tests which demonstrate that both "ab"@en and "ab"@EN parse to each other.

VladimirAlexiev commented 6 years ago

"ISO": indeed eg ru-Cyrl-RU is preferable to ru-cyrl-ru . Can you say that converting back to ShexC converts tags using "BCP normalization"? I posted to the RDF mlist https://lists.w3.org/Archives/Public/public-rdf-comments/2014Jan/0011.html including overview of what different fraemworks do (eg Sesame lowercases, Jena preserves). and an implementation https://rt.cpan.org/Public/Ticket/Attachment/1267147/670949/lang_normalize.pl.

This won't help round-tripping but at least will enforce determinism.

ericprud commented 6 years ago

If I understand, the proposal is to:

UPPERCASE any two-letter sequence following a sequence of two or more letters.
Titlecase any four-letter sequence following a sequence of two or more letters.

e.g. mn-Cyrl-MN. I am motivated to improve RDF conformance with BCP47 rather than continue to propagate lazy short-cuts.

Does this derive from the BCP47 grammar or some other text in the doc? Diving deeper into BCP47 than I ever wanted, I see a finite list of irregular tags with the comment "most are deprecated". What would be best here, ignore them or reference them from the spec (and thus stick them in every impl)?

Can you make a PR on the spec value constraints section (and maybe value set parsing) to make this concrete? I'd propose @gkellogg and @ericprud as reviewers.

VladimirAlexiev commented 6 years ago

Your description is correct (but there's also a dash between the two sequences). The script quotes verbatim from http://tools.ietf.org/html/bcp47#section-2.1.1. Irregular tags will be normalized to what is given in the spec. You don't need to specify them separately.

I don't think this normalization has any bearing on validation, since validation must be case-insensitive. Cheers!

ericprud commented 6 years ago

I wasn't worried about the validation, just what exactly how to specify the canonical form. I guess you have something in mind like:

When emiting a ShEx schema, language tags in that schema SHOULD be in the the canonical language tag form in order to comply with [[!BCP47]] section @@!. A language tag is in canonical language tag form if a language tag is split on '-' into a set of sequences and the following rules applied before it is joined again on '-':

Each two-letter sequence following a sequence of two or more letters is in uppercase, e.g. ab-CD-EF-ghi

Each four-letter sequence following a sequence of two or more letters is in title case. e.g. ab-Cdef-Ghij

Where in BCP47 do the capitalization rules come from? Can we justify the rules above?

VladimirAlexiev commented 6 years ago

before it is joined again

This is wrong. This would be ambiguous for eg x-whatever. The rules require capitalziation in eg x-what-Ever and x-what-EV but require nothing in x-whatever or x-what-everything or x-what-eve.

Just refer to sec 2.1.1. IMHO you don't need to restate the rules, just give some examples

VladimirAlexiev commented 6 years ago

I think I misread what "joined" means. I still think you don't need to restate the rules, but if you want to do it, please change "set of sequences" to "sequence of strings"

shexSpec / shex

Round tripping language tags case #73