globalwordnet / schemas

WordNet-LMF formats
https://globalwordnet.github.io/schemas/
20 stars 11 forks source link

2.10 #13

Closed 1313ou closed 4 years ago

1313ou commented 4 years ago

This completes the evolution .Please refer to :+1: global README specific README

EWN would look like this:

N O U N S

         <Sense 
                id="ewn-compulsion-n-09206152-01"
                n="1"
                senseidx="1"   <-- index in lexical entry
                lexid="2"   <-- generated: 2nd sense of many
                synset="ewn-09206152-n"
                sensekey="compulsion%1:16:02::"   <-- generated
                pwn:sensekey="compulsion%1:16:00::"   <-- PWN31
                tagCnt="1">  <-- tag count in PWN31

         </Sense>

V E R B S

      <!-- FACTORED-OUT VERB FRAMES -->
      <SyntacticBehaviour id="ewn-sb-9" subcategorizationFrame="Somebody ----s somebody"/>
      <SyntacticBehaviour id="ewn-sb-10" subcategorizationFrame="Something ----s somebody"/>

      <!-- FACTORED-OUT VERB TEMPLATES -->
      <SyntacticBehaviour id="ewn-st-15" sentenceTemplate="Sam cannot %s Sue"/>
      <SyntacticBehaviour id="ewn-st-59" sentenceTemplate="They %s money on their grandchild"/>
      <SyntacticBehaviour id="ewn-st-143" sentenceTemplate="Did he %s his foot?"/>
      <SyntacticBehaviour id="ewn-st-97" sentenceTemplate="They %s"/>
      <SyntacticBehaviour id="ewn-st-98" sentenceTemplate="They %s themselves"/>

         <Sense 
                id="ewn-fatigue-v-00074774-12"
                n="1"
                senseidx="0"
                lexid="1"   <-- generated: 1st sense of many
                synset="ewn-00074774-v"
                sensekey="fatigue%2:29:01::"   <-- generated: 1st sense of many
                pwn:sensekey="fatigue%2:29:00::"   <-- sks diverge here -->
                verbFrames="ewn-sb-10 ewn-sb-9"   <-- multiple refs to frame declarations above -->
        verbTemplates="ewn-st-15"   <-- ref to template declaration above -->
                >

         </Sense>

         <Sense 
                id="ewn-bandage-v-00082877-01"
                n="1"
                senseidx="0"
                lexid="0"   <-- generated: unique 1st sense -->
                synset="ewn-00082877-v"
                sensekey="bandage%2:29:00::"
                pwn:sensekey="bandage%2:29:00::"
                verbFrames="ewn-sb-8"
                verbTemplates="ewn-st-143"
                >

         </Sense>

         <Sense 
                id="ewn-shower-v-00035252-01"
                n="2"
                senseidx="0"
                lexid="0"
                synset="ewn-00035252-v"
                sensekey="shower%2:29:00::"
                pwn:sensekey="shower%2:29:00::"
                verbFrames="ewn-sb-2"
        verbTemplates="ewn-st-97 ewn-st-98"
                tagCnt="2">   <-- tag count from PWN31

         </Sense>

A D J S

        <Sense 
                id="ewn-galore-ip-s-00014377-02"
                n="1"
                senseidx="0"
                lexid="1"
                adjPosition="ip"   <-- immediate postnominal: errors galore
                synset="ewn-00014377-s"
                sensekey="galore%5:00:01:abundant:00"
                pwn:sensekey="galore%5:00:00:abundant:00"
                />
jmccrae commented 4 years ago

I am going to close this PR as it does not fix any reported issues.

Some comments:

1313ou commented 4 years ago

Sorry to bring out a few things tucked under the carpet. Here ar a few comments to your comments.

It should be encoded within the id attribute

Principle 1: XML IDs should not be parsed for information. They should be opaque to machine processing even though they can help troubleshooting. It can be considered bad practise, not to say a hack, to make sense of them. Information should reside in non-ID attributes, and element text, not in ID attributes. To push the argument further;

Principle 2: It's best to keep processing local. You don't want to process a sense by having to explore its siblings, by accessing its parent, and then iterating the parent's children... If you must do it repeatedly, annotate senses.

Principle 3: Some information is lost when merging. Prevent this by making it explicit and immune to merging.

SENSEIDX

In the merged file (SenseRelations dropped for clarity), you'll find this:

 <LexicalEntry id="ewn-abandon-v">
      <Lemma partOfSpeech="v" writtenForm="abandon" />
      <Sense id="ewn-abandon-v-02232813-01" synset="ewn-02232813-v" dc:identifier="abandon%2:40:00::"/>
      <Sense id="ewn-abandon-v-02232523-01" synset="ewn-02232523-v" dc:identifier="abandon%2:40:01::"/>
      <Sense id="ewn-abandon-v-02080923-03" synset="ewn-02080923-v" dc:identifier="abandon%2:38:00::"/>
      <Sense id="ewn-abandon-v-00614907-01" synset="ewn-00614907-v" dc:identifier="abandon%2:31:01::"/>
      <Sense id="ewn-abandon-v-00615748-01" synset="ewn-00615748-v" dc:identifier="abandon%2:31:00::"/>
    </LexicalEntry>

How do you retrieve the rank of ewn-00615748-v in the lexical file (2nd sense) if you don't interpret IDs ? Here, either you parse the dc:identifier OR jump to LexicalEntry parent, iterate the Sense children, dereference its synset attribute to get the Synset, read its dc:subject value to check if the child sense belongs to the lexfile, increment the senseidx counter accordingly and it this child sense is the target sense we are done OR you annotate before the information is lost?

LEXID

The lexid is used in generating the sensekey. It makes explicit what enters into sensekey generation. Not strictly needed.

SENSEKEYS

sensekey and pwn:sensekey: This makes no sense at all... sense keys are the IDs of sense in one project (Princeton WordNet), there is no need to have three different attributes for this.

Sorry it does make some sense. dc:identifier is out (that does not make sense) , renamed pwn:sensekey. Which leaves us with 2 sensekeys, not three, pwn:sensekey to embody the foreign key, sensekey to implement the inner key.

GENERATED/ ANNOTATED vs NATIVE

It didn't escape you that there are two sets of XSDs. The 1.1 and 2.0 series is for lexicographer files. The 1.10 and 2.10 are for augmented/annotated files.

lexfile, lexid, senseidx, sensekey are generated. They augment the information but they are computable. They are annotations produced by the pipeline. As such they don't have to be in the lexicographer files. They belong in the 1.10 and 2.10 schemas.

adposition, tagcnt are native. They are just imported here because they are present in PWN and EWN lacks them. They belong in the 1.1, and 2.0 schemas.

TAGCNT

Imported from PWN. Very useful when you want to swiftly sort senses by usage.

REDUNDANCY

While we are at it.

Why a "dc:subject" when this is the name of the file? What's the point of repeating "noun.object" in all the synsets in wn-noun.xml" ? Wouldn't that be better handled by the merge processor ?

Why repeat verb frame text throughout the files. Better declare them, and then reference them.

ISSUES

I am going to close this PR as it does not fix any reported issues.

Well I have reported a severe issue. See Issue #5. The files in EWN do not validate against the dc: namespace they mention in their header. Element defs are mistaken for attribute definitions.