Closed 1313ou closed 4 years ago
I am going to close this PR as it does not fix any reported issues.
Some comments:
senseidx
: This is unnecessary, the <Sense>
tags are already ordered within the <LexicalEntry>
lexid
: This is a unique structural feature of Princeton WordNet that most wordnets do not support. It should be encoded within the id
attributesensekey
and pwn:sensekey
: This makes no sense at all... sense keys are the IDs of sense in one project (Princeton WordNet), there is no need to have three different attributes for this.tagCnt
: The representation of frequency information would be useful, but this is a very basic way to do it. An issue needs to be created to discuss this properly.adjPosition
: This ignores an existing pull request (#9) to handle this.Sorry to bring out a few things tucked under the carpet. Here ar a few comments to your comments.
It should be encoded within the id attribute
Principle 1: XML IDs should not be parsed for information. They should be opaque to machine processing even though they can help troubleshooting. It can be considered bad practise, not to say a hack, to make sense of them. Information should reside in non-ID attributes, and element text, not in ID attributes. To push the argument further;
Principle 2: It's best to keep processing local. You don't want to process a sense by having to explore its siblings, by accessing its parent, and then iterating the parent's children... If you must do it repeatedly, annotate senses.
Principle 3: Some information is lost when merging. Prevent this by making it explicit and immune to merging.
SENSEIDX
In the merged file (SenseRelations dropped for clarity), you'll find this:
<LexicalEntry id="ewn-abandon-v">
<Lemma partOfSpeech="v" writtenForm="abandon" />
<Sense id="ewn-abandon-v-02232813-01" synset="ewn-02232813-v" dc:identifier="abandon%2:40:00::"/>
<Sense id="ewn-abandon-v-02232523-01" synset="ewn-02232523-v" dc:identifier="abandon%2:40:01::"/>
<Sense id="ewn-abandon-v-02080923-03" synset="ewn-02080923-v" dc:identifier="abandon%2:38:00::"/>
<Sense id="ewn-abandon-v-00614907-01" synset="ewn-00614907-v" dc:identifier="abandon%2:31:01::"/>
<Sense id="ewn-abandon-v-00615748-01" synset="ewn-00615748-v" dc:identifier="abandon%2:31:00::"/>
</LexicalEntry>
How do you retrieve the rank of ewn-00615748-v in the lexical file (2nd sense) if you don't interpret IDs ? Here, either you parse the dc:identifier OR jump to LexicalEntry parent, iterate the Sense children, dereference its synset attribute to get the Synset, read its dc:subject value to check if the child sense belongs to the lexfile, increment the senseidx counter accordingly and it this child sense is the target sense we are done OR you annotate before the information is lost?
LEXID
The lexid is used in generating the sensekey. It makes explicit what enters into sensekey generation. Not strictly needed.
SENSEKEYS
sensekey and pwn:sensekey: This makes no sense at all... sense keys are the IDs of sense in one project (Princeton WordNet), there is no need to have three different attributes for this.
Sorry it does make some sense. dc:identifier is out (that does not make sense) , renamed pwn:sensekey. Which leaves us with 2 sensekeys, not three, pwn:sensekey to embody the foreign key, sensekey to implement the inner key.
GENERATED/ ANNOTATED vs NATIVE
It didn't escape you that there are two sets of XSDs. The 1.1 and 2.0 series is for lexicographer files. The 1.10 and 2.10 are for augmented/annotated files.
lexfile, lexid, senseidx, sensekey are generated. They augment the information but they are computable. They are annotations produced by the pipeline. As such they don't have to be in the lexicographer files. They belong in the 1.10 and 2.10 schemas.
adposition, tagcnt are native. They are just imported here because they are present in PWN and EWN lacks them. They belong in the 1.1, and 2.0 schemas.
TAGCNT
Imported from PWN. Very useful when you want to swiftly sort senses by usage.
REDUNDANCY
While we are at it.
Why a "dc:subject" when this is the name of the file? What's the point of repeating "noun.object" in all the synsets in wn-noun.xml" ? Wouldn't that be better handled by the merge processor ?
Why repeat verb frame text throughout the files. Better declare them, and then reference them.
ISSUES
I am going to close this PR as it does not fix any reported issues.
Well I have reported a severe issue. See Issue #5. The files in EWN do not validate against the dc: namespace they mention in their header. Element defs are mistaken for attribute definitions.
This completes the evolution .Please refer to :+1: global README specific README
EWN would look like this:
N O U N S
V E R B S
A D J S