omwn / omw-data

This packages up data for the Open Multilingual Wordnet
43 stars 3 forks source link

WN30: sense_number info needed for NLTK-style synset names #12

Closed goodmami closed 3 years ago

goodmami commented 3 years ago

The NLTK uses names like spartan.a.01 for convenient synset lookup. To facilitate compatibility with the NLTK when using a WN-LMF lexicon, I'd like to be able to recreate or store these names. The lemma part is the first lemma listed in a WNDB data file (synset) entry, e.g.:

$ grep -Pi ' spartan .*\|' data.*
data.adj:00009618 00 s 04 ascetic 0 ascetical 0 austere 0 spartan 0 004 & 00009046 a 0000 + 04881998 n 0301 + 09758173 n 0202 + 09758173 n 0102 | practicing great self-denial; "Be systematically ascetic...do...something for no other reason than that you would rather not do it"- William James; "a desert nomad's austere life"; "a spartan diet"; "a spartan existence"  
data.adj:01301316 00 s 02 severe 2 spartan 0 003 & 01299888 a 0000 + 04639732 n 0102 + 04639732 n 0101 | unsparing and uncompromising in discipline or judgment; "a parent severe to the pitch of hostility"- H.G.Wells; "a hefty six-footer with a rather severe mien"; "a strict disciplinarian"; "a Spartan upbringing"  
data.adj:01991462 00 s 01 spartan 0 001 & 01989669 a 0000 | resolute in the face of pain or danger or adversity; "spartan courage"  
data.adj:02972690 01 a 01 Spartan 0 002 + 08787240 n 0101 \ 08787240 n 0101 | of or relating to or characteristic of Sparta or its people  
data.noun:09711661 18 n 01 Spartan 0 002 @ 09710164 n 0000 #m 08787240 n 0000 | a resident of Sparta  

Compare to:

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('spartan')
[Synset('spartan.n.01'), Synset('spartan.a.01'), Synset('spartan.s.02'), Synset('severe.s.04'), Synset('ascetic.s.02')]

These lemmas can be obtained by looking at the first sense in the members attribute of a <Synset> in a WN-LMF 1.1 lexicon.

The numeric portion of the name is the 1-based index of the synset offset for that lemma in the index file (e.g., see that the 01301316 sense of "severe" is the 4th offset in the index, hence severe.s.04):

$ grep -P '^(spartan [an] |ascetic a |severe a )' index.*
index.adj:ascetic a 2 3 & \ + 2 0 02644177 00009618  
index.adj:severe a 6 2 & + 6 5 01513050 02322512 01792387 01301316 00651039 01129185  
index.adj:spartan a 4 3 & \ + 4 0 02972690 01991462 01301316 00009618  
index.noun:spartan n 1 2 @ #m 1 0 09711661  

However, this index information is not present in a WN-LMF file. It's not sufficient to look at the index of the <Sense> in a <LexicalEntry> because senses are grouped by case-insensitive lemmas in WNDB's index files but in WN-LMF they are kept separate:

$ grep 'writtenForm="[Ss]partan"' -B1 omw-data/release/pwn30/wn30.xml 
    <LexicalEntry id="pwn-spartan-s" >
      <Lemma writtenForm="spartan" partOfSpeech="s" />
--
    <LexicalEntry id="pwn-Spartan-n" >
      <Lemma writtenForm="Spartan" partOfSpeech="n" />
--
    <LexicalEntry id="pwn-Spartan-a" >
      <Lemma writtenForm="Spartan" partOfSpeech="a" />

Proposals

  1. The sense_number of the first lemma could be encoded in the Synset ID (e.g., wn30-01301316-04-s) or the ID could just be the NLTK-style names (wn30-severe.s.04). The problem with this is it becomes difficult or impossible to lookup synsets only by offset and ss_type, which is necessary for reading Information Content (IC) files. We might also need to parse the IDs to get the name again, when IDs should be opaque.
  2. We could use the n attribute on <Sense> in the WN-LMF 1.1 "relaxed" schema to encode the sense_number. The problem with this is it legitimizes a crutch to keep compatibility with WNDB databases, while I feel that WN-LMF should stand on its own. Compatibility is the reason for this issue, but at the same time I'd like to remain forward-thinking. I'm also not sure that's the intended use of the attribute (I thought it was just for sense ranking in synsets; that is, the members attribute on <Synset> obviated the need for n on <Sense>).
  3. We could use something like dc:identifier to encode the NLTK-style name, e.g., <Synset ... dc:identifier="severe.s.04">. The problem with this is it's assigning an ad hoc interpretation to a generic attribute.

Of these, I have a slight preference for 3 over 2, while 1 is just a non-proposal to illustrate why it's a bad idea.