globalwordnet / schemas

WordNet-LMF formats
https://globalwordnet.github.io/schemas/
20 stars 11 forks source link

Validation Schema (dc + foreign keys + namespaces + sensekeys) #5

Closed 1313ou closed 4 years ago

1313ou commented 5 years ago

The bad news

The dc: attributes are defined as elements not attributes by Dublin Core so that any attempt to validate against this external reference will fail. How to see this ? The namespace URL http://purl.org/dc/elements/1.1/ redirects to http://dublincore.org/specifications/dublin-core/dcmi-terms/2012-06-14/?v=elements

This page maps the URL to a schema location and contains the following:

Target namespace: http://purl.org/dc/elements/1.1/
Schema location: http://dublincore.org/schemas/xmls/qdc/2008/02/11/dc.xsd

The latter dc.xsd can be downloaded and defines

  <xs:element name="title" substitutionGroup="any"/>
  <xs:element name="creator" substitutionGroup="any"/>
  <xs:element name="subject" substitutionGroup="any"/>
  <xs:element name="description" substitutionGroup="any"/>
  <xs:element name="publisher" substitutionGroup="any"/>
  <xs:element name="contributor" substitutionGroup="any"/>
  <xs:element name="date" substitutionGroup="any"/>
  <xs:element name="type" substitutionGroup="any"/>
  <xs:element name="format" substitutionGroup="any"/>
  <xs:element name="identifier" substitutionGroup="any"/>
  <xs:element name="source" substitutionGroup="any"/>
  <xs:element name="language" substitutionGroup="any"/>
  <xs:element name="relation" substitutionGroup="any"/>
  <xs:element name="coverage" substitutionGroup="any"/>
  <xs:element name="rights" substitutionGroup="any"/>

All elements are declared as substitutable for the abstract element any, which means that the default type for all elements is dc:SimpleLiteral.

The good news

Some documents specify the schema they expect to be validated against, typically using xsi:noNamespaceSchemaLocation and/or xsi:schemaLocation attributes.

However, normally this isn't what you want. Usually the document consumer should choose the schema, not the document producer. This is what I did here where validation data is split according to namespaces between dc.xsd and WN-LMF-1.1.xsd.

Foreign-key attributes in the schema should have their own namespaces

dc: attributes are obviously meta data (I have grouped them into a Meta attribute group) except

Need for a 'sensekey' attribute

This leaves the problem of sensekeys that would have a meaning within the current database. If they have such a meaning, they are generated, not copied. Incidentally, let me mention that each version of WordNet generates its own sensekeys, the grinder tool does that. It turns out the sensekeys can be generated. I have worked on a XSLT-based transformer tool that does just that in a declarative way (XML-to-XML XSLT transformation description) and is to be found here. More on this later. The transformer adds a sensekey attribute. It would make sense to use it in a standalone generation of index.sense.

Besides being pointers, generated sensekeys have also been considered a measure of stability between successive versions of the WordNet database (if two versions generate the same sensekeys it's highly likely that nothing has changed in the distribution of senses). It can be used as such (and given an important weight) by the relaxmapper which is meant to find mappings between hierarchies of data. Again, this makes sense if the sensekeys are generated, not copied.

Need for a 'lexfile' attribute

See above in Foreign-key attributes section. Another option is to put it it in the top element (either LexicalResource or Lexicon) of the xml lexical file (it can then be easily accessed by tools). But the problem will remain when merging.

Factor out SyntacticBehaviour

Allowing it

would avoid considerable redundancy ("The banks %s the check" is repeated 7433 times!)

Like this:

<LexicalResource>
  <Lexicon>
...
        <SyntacticBehaviour id='svo1' subcategorizationFrame="The banks %s the check" />
        <SyntacticBehaviour id='sv1'  subcategorizationFrame="The coins %s "/>
        <SyntacticBehaviour id='svo2' subcategorizationFrame="They %s the bags on the table" />
        <SyntacticBehaviour id='svo3' subcategorizationFrame="They %s the coin " />
...
        <LexicalEntry id="ewn-inoculate-v" >
            <Lemma writtenForm="inoculate" partOfSpeech="v" />
            <Sense id="ewn-inoculate
            <SyntacticBehaviour idref="svo1" senses="ewn-inoculate-v-00086587-03"/>
            <SyntacticBehaviour idref="sv1" senses="ewn-inoculate-v-00053234-01 ewn-inoculate-v-00055835-01 ewn-inoculate-v-00188584-01"/>
            <SyntacticBehaviour idref="sv02" senses="ewn-inoculate-v-00188584-01"/>
            <SyntacticBehaviour idref="sv03" senses="ewn-inoculate-v-00086587-03 ewn-inoculate-v-00834278-01" />
...
        </LexicalEntry>
  </Lexicon>
</LexicalResource/">

Note I left out the problem of naming these frames.

arademaker commented 4 years ago

Can you summarize here the differences between 1.1, 2.0 and 1.10? Can you also explain the reason for keeping all versions in the repo instead of only one? The change from DTD to XSD seems resonable, but the abstract model is the most important point.

1313ou commented 4 years ago

It's documented here

goodmami commented 4 years ago

Now that I'm looking a bit more closely at these attributes, I'm mostly in agreement with Bernard's points here. The DC metadata don't follow the actual Dublin Core schema, so using the dc: namespace seems a little problematic. Furthermore, only 3 of these attributes are actually used in EWN 2020:

I think only dc:source is used in a general metadata way. The others have a specific WordNet-internal interpretation and would probably be better with their own dedicated attribute (as was done for language; see #22).

I don't have a strong opinion about the namespaces for these attributes, nor about the less redundant SyntacticBehaviour.

fcbond commented 4 years ago

I agree that we should probably try to have fixed names for attributes that we will use with specific WordNet-internal interpretations.

jmccrae commented 4 years ago

Rather than discussing four issues here, I am closing this issue. Please discuss under:

8 - Syntactic Behaviour

24 - Dublin Core

25 - Sense Keys

26 - Lexicographer files