Closed VladimirAlexiev closed 5 years ago
Resolved with 20170915 meeting
Resolution: change language tag matching to follow RFC4647 per
voted by: Andra, Kat, ericP, tom
need feedback from @VladimirAlexiev on spec changes and tests before closing. Note that the issue demo fails on master (<shouldFail>
passes because the test doesn't respet rfc4647) but passes the LanguageStem-rfc4647 branch.
Spec sounds good, I like the ref to https://tools.ietf.org/html/rfc4647#section-3.3.1. Maybe say that *
is not allowed, and what happens if I give an incomplete lang tag eg @e~
(answer: won't match any value).
Tests look correct, but:
@fr-bel
Cheers @ericprud !
I was going to do a separate PR to add "*" to the grammar a la
[55] languageRange ::= (LANGTAG | '*') ('~' languageExclusion*)?
I tried to find two region codes that where one was a substring of other. Do you know where I can find the canonical list of regions? I picked a valid three-letter ISO region code ("bel"). I guess I could switch from FR to DE and use the example from RFC4647 basic match.
Re: case variation, true. Early on, I had data files like spo@fr.ttl
and spo@FR.ttl
but I think some case-insensitive file system ate them long ago. Will re-add tests for that and for shex files matching @FR
, . - ~@FR
and @FR~ - ~FR-BE
.
Regions: https://docs.google.com/spreadsheets/d/1M1yv9aBUmc-NyCJX69vOLUmH2uIglSwmDwgRgByI1AI/edit#gid=2001354273 and filter by type=region. These are 2-letter country codes and 3-digit continent-like codes. So there are no "substring of another".
But if there were, the matching is still the same: next should come dash or end of string. I.e. @en-G~
will not match @en-GB
and @en-GR
.
What do you want with *
? Eg @*-GB
to match any language spoken in Great Britain?
[55]
is not enough. langMatches()
supports it, I'm a bit doubftul!!!!! Because Cyrl
is the default script for ru
, ru
is the same as ru-Cyrl
. This means that ru-RU~
should match ru-Cyrl-RU
. My oh my.
And the star would add more complications
Re case sensitivity, I varied the case in the data and the schema. The latter raised a round-tripping issue to RDF. I invite you to review those PRs.
It is our belief that the semantics in ShEx 2.1 § 5.4.6 Values Constraint address this. Please close this issue if you agree.
I've read the section and I think it addresses this by reference to other standards. In particular I like: st is a basic language range per Matching of Language Tags [rfc4647] section 2.1 and l matches st per the basic filtering scheme defined in [rfc4647] section 3.3.1.
In other words, one is not supposed to use an incomplete stem like en-G~
The following shape:
:SpanishProduct { schema:label [ @es~ ] }
Declares that products must have a label in Spanish or any variant of it (eges-ES
vses-AR
).But LanguageStem is defined as simple prefix match (http://shex.io/shex-semantics/#nodeIn):
It has these defects:
"Carro"@ese
whereese
is Ese Ejja, and I don't think those people got cars ;-)"Carro"@ES
but lang tags are defined to be case-insensitive.st
should refer tos
)Instead of simple prefix match, it should comply with https://www.w3.org/TR/sparql11-query/#func-langMatches semantics. RFC4647 defines tags for lang, script, dialect, region etc etc; and that it's case-insensitive. Assuming
s
doesn't end in-
and assuming.
represents concat, it can be defined eg like:regex (l, "(^".s."$)|(^".s."-)", "i")
Note: a simpler regex would be"^".s."($|-)"
but I don't believe the last part of it is valid.Aside: https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry is a bit unreadable. The script https://gist.github.com/VladimirAlexiev/8733439 turns it into this more readable google sheet
TEST: @ericprud gave this example URL. For me, it doesn't load the test on first load (or control-shift-R) but loads it on second refresh (control-R): http://rawgit.com/shexSpec/shex.js/master/doc/shex-simple.html?schema=%3CS%3E%20%7B%20%3Cp%3E%20%5B%40aa~%5D%20%7D&data=%3Cexact%3E%20%3Cp%3E%20%22exact%22%40aa%20.%0A%3Csub%3E%20%3Cp%3E%20%22sub%22%40aa-ES%20.%0A%3CshouldFail%3E%20%3Cp%3E%20%22shouldFail%22%40aaa-ES%20.%0A&shape-map=%7BFOCUS%20%3Cp%3E%20_%7D%40%3CS%3E