shexSpec / spec

ShEx specification
https://shexspec.github.io/spec/
Other
14 stars 10 forks source link

ShExC valueSetValue/exclusions examples inconsistent with example #42

Open gkellogg opened 2 years ago

gkellogg commented 2 years ago

As noted in https://lists.w3.org/Archives/Public/public-shex/2021Aug/0001.html:

In ShEx 2.0, the productions were defined as follows:

[49]    valueSetValue         ::= iriRange | literalRange | languageRange | '.' exclusion+
[50]    exclusion             ::= '-' (iri | literal | LANGTAG) '~'?

In ShEx 2.1, they were updated to the following:

[49]    valueSetValue         ::= iriRange | literalRange | languageRange | exclusion+
[50]    exclusion             ::= '.' '-' (iri | literal | LANGTAG) '~’?

But, the note on [49] still notes "If "." matches and exclusion matches one or more times”, and that doesn’t make sense in this context. Also, the third ValuesConstraint example has a ‘.’ only at the beginning:

ex:EmployeeShape {
  foaf:mbox [ . - <mailto:engineering->~ - <mailto:sales->~ ]
}

Looks like the changes were made in error? Certainly, the new grammar is not forward-compatible with 2.0.

ericprud commented 2 years ago

While we're dealing with this, I think we can have a bit more sanity checking by saying that the exclusions have to be homogeneous. As a counter example. consider

  foo:code [. # any RDF term...
    - 'a'~ - 'e'~  # ... except strings starting with 'a' or 'e'
    - @en-UK~ - @fr~ # ... or British or French RDF langStrings (regardless of region, script, etc.)
  ]

Would it permit this?:

<s> foo:code <http://a.example> .

The grammar would imply that it does but in ShExJ, we see that exclusions are typed, e.g. LiteralStemRange and LanguageStemRange in:

      { "type": "TripleConstraint",
        "predicate": "...code",
        "valueExpr": {
          "type": "NodeConstraint",
          "values": [
            { "type": "LiteralStemRange",
              "stem": { "type": "Wildcard" },
              "exclusions": [
                { "type": "LiteralStem", "stem": "a" },
                { "type": "LiteralStem", "stem": "e" }
              ] },
            { "type": "LanguageStemRange",
              "stem": { "type": "Wildcard" },
              "exclusions": [
                "en-UK",
                "fr"
              ] }
          ] } }

With homogenous exclusions, we can reflect the ShExJ. You could still state the above, but you'd need two terms in the valueSet:

  foo:code [
    . -'a'~ -'e'~ # any string, except one starting with 'a' or 'e'
    . -@en-UK~ -@fr~ # none of them Britishisms, and nothing French
  ]

Here's the grammar that ShExJS uses (which passes the tests):

valueSetValue: iriRange | literalRange | languageRange
    | '.' (iriExclusion+ | literalExclusion+ | languageExclusion+)

iriRange: iri ('~' iriExclusion*)?

iriExclusion: '-' iri '~'?

literalRange: literal ('~' literalExclusion*)?

literalExclusion: '-' literal '~'?

languageRange:
      LANGTAG ('~' languageExclusion*)?
    | '@' '~' languageExclusion*

languageExclusion: '-' LANGTAG '~'?

Which lines up with https://github.com/shexSpec/grammar/blob/master/ShExDoc.g4#L149-L161.

PROPOSE: adopt the ANTLR productions for valueSetValue,

gkellogg commented 2 years ago

That seems reasonable, although I'll need to implement it for myself to be sure.