language stem should respect langMatches semantics

VladimirAlexiev commented 6 years ago

The following shape: :SpanishProduct { schema:label [ @es~ ] } Declares that products must have a label in Spanish or any variant of it (eg es-ES vs es-AR).

But LanguageStem is defined as simple prefix match (http://shex.io/shex-semantics/#nodeIn):

s is a LanguageStem and n is a language-tagged string with a language tag l and fn:starts-with(l, st)

It has these defects:

it will match language "Carro"@ese where ese is Ese Ejja, and I don't think those people got cars ;-)
it won't match "Carro"@ES but lang tags are defined to be case-insensitive.
(instead of st should refer to s)

Instead of simple prefix match, it should comply with https://www.w3.org/TR/sparql11-query/#func-langMatches semantics. RFC4647 defines tags for lang, script, dialect, region etc etc; and that it's case-insensitive. Assuming s doesn't end in - and assuming . represents concat, it can be defined eg like: regex (l, "(^".s."$)|(^".s."-)", "i") Note: a simpler regex would be "^".s."($|-)" but I don't believe the last part of it is valid.

Aside: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry is a bit unreadable. The script https://gist.github.com/VladimirAlexiev/8733439 turns it into this more readable google sheet

TEST: @ericprud gave this example URL. For me, it doesn't load the test on first load (or control-shift-R) but loads it on second refresh (control-R): http://rawgit.com/shexSpec/shex.js/master/doc/shex-simple.html?schema=%3CS%3E%20%7B%20%3Cp%3E%20%5B%40aa~%5D%20%7D&data=%3Cexact%3E%20%3Cp%3E%20%22exact%22%40aa%20.%0A%3Csub%3E%20%3Cp%3E%20%22sub%22%40aa-ES%20.%0A%3CshouldFail%3E%20%3Cp%3E%20%22shouldFail%22%40aaa-ES%20.%0A&shape-map=%7BFOCUS%20%3Cp%3E%20_%7D%40%3CS%3E

jimkont commented 6 years ago

Resolved with 20170915 meeting

Resolution: change language tag matching to follow RFC4647 per

voted by: Andra, Kat, ericP, tom

ericprud commented 6 years ago

See ~ LanguageStem follows rfc4647

ericprud commented 6 years ago

need feedback from @VladimirAlexiev on spec changes and tests before closing. Note that the issue demo fails on master (<shouldFail> passes because the test doesn't respet rfc4647) but passes the LanguageStem-rfc4647 branch.

VladimirAlexiev commented 6 years ago

Spec sounds good, I like the ref to https://tools.ietf.org/html/rfc4647#section-3.3.1. Maybe say that * is not allowed, and what happens if I give an incomplete lang tag eg @e~ (answer: won't match any value).

Tests look correct, but:

feel a bit uncomfortable about using unregistered sublang tags like @fr-bel
Maybe do some case variation (the matching should be case-insensitive)

Cheers @ericprud !

ericprud commented 6 years ago

I was going to do a separate PR to add "*" to the grammar a la

[55] languageRange ::= (LANGTAG | '*') ('~' languageExclusion*)?

I tried to find two region codes that where one was a substring of other. Do you know where I can find the canonical list of regions? I picked a valid three-letter ISO region code ("bel"). I guess I could switch from FR to DE and use the example from RFC4647 basic match.

Re: case variation, true. Early on, I had data files like spo@fr.ttl and spo@FR.ttl but I think some case-insensitive file system ate them long ago. Will re-add tests for that and for shex files matching @FR, . - ~@FR and @FR~ - ~FR-BE.

VladimirAlexiev commented 6 years ago

Regions: https://docs.google.com/spreadsheets/d/1M1yv9aBUmc-NyCJX69vOLUmH2uIglSwmDwgRgByI1AI/edit#gid=2001354273 and filter by type=region. These are 2-letter country codes and 3-digit continent-like codes. So there are no "substring of another".

But if there were, the matching is still the same: next should come dash or end of string. I.e. @en-G~ will not match @en-GB and @en-GR.

What do you want with *? Eg @*-GB to match any language spoken in Great Britain?

I think this falls under "extended matching" https://tools.ietf.org/html/rfc4647#section-3.3.2. And you can put the star in any position, so the above [55] is not enough.
Check whether langMatches() supports it, I'm a bit doubftul

!!!!! Because Cyrl is the default script for ru, ru is the same as ru-Cyrl. This means that ru-RU~ should match ru-Cyrl-RU. My oh my.

And the star would add more complications

ericprud commented 6 years ago

Re case sensitivity, I varied the case in the data and the schema. The latter raised a round-tripping issue to RDF. I invite you to review those PRs.

ericprud commented 5 years ago

It is our belief that the semantics in ShEx 2.1 § 5.4.6 Values Constraint address this. Please close this issue if you agree.

VladimirAlexiev commented 5 years ago

I've read the section and I think it addresses this by reference to other standards. In particular I like: st is a basic language range per Matching of Language Tags [rfc4647] section 2.1 and l matches st per the basic filtering scheme defined in [rfc4647] section 3.3.1.

In other words, one is not supposed to use an incomplete stem like en-G~

shexSpec / shex

language stem should respect langMatches semantics #71