The NLTK uses names like spartan.a.01 for convenient synset lookup. To facilitate compatibility with the NLTK when using a WN-LMF lexicon, I'd like to be able to recreate or store these names. The lemma part is the first lemma listed in a WNDB data file (synset) entry, e.g.:
$ grep -Pi ' spartan .*\|' data.*
data.adj:00009618 00 s 04 ascetic 0 ascetical 0 austere 0 spartan 0 004 & 00009046 a 0000 + 04881998 n 0301 + 09758173 n 0202 + 09758173 n 0102 | practicing great self-denial; "Be systematically ascetic...do...something for no other reason than that you would rather not do it"- William James; "a desert nomad's austere life"; "a spartan diet"; "a spartan existence"
data.adj:01301316 00 s 02 severe 2 spartan 0 003 & 01299888 a 0000 + 04639732 n 0102 + 04639732 n 0101 | unsparing and uncompromising in discipline or judgment; "a parent severe to the pitch of hostility"- H.G.Wells; "a hefty six-footer with a rather severe mien"; "a strict disciplinarian"; "a Spartan upbringing"
data.adj:01991462 00 s 01 spartan 0 001 & 01989669 a 0000 | resolute in the face of pain or danger or adversity; "spartan courage"
data.adj:02972690 01 a 01 Spartan 0 002 + 08787240 n 0101 \ 08787240 n 0101 | of or relating to or characteristic of Sparta or its people
data.noun:09711661 18 n 01 Spartan 0 002 @ 09710164 n 0000 #m 08787240 n 0000 | a resident of Sparta
Compare to:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('spartan')
[Synset('spartan.n.01'), Synset('spartan.a.01'), Synset('spartan.s.02'), Synset('severe.s.04'), Synset('ascetic.s.02')]
These lemmas can be obtained by looking at the first sense in the members attribute of a <Synset> in a WN-LMF 1.1 lexicon.
The numeric portion of the name is the 1-based index of the synset offset for that lemma in the index file (e.g., see that the 01301316 sense of "severe" is the 4th offset in the index, hence severe.s.04):
$ grep -P '^(spartan [an] |ascetic a |severe a )' index.*
index.adj:ascetic a 2 3 & \ + 2 0 02644177 00009618
index.adj:severe a 6 2 & + 6 5 01513050 02322512 01792387 01301316 00651039 01129185
index.adj:spartan a 4 3 & \ + 4 0 02972690 01991462 01301316 00009618
index.noun:spartan n 1 2 @ #m 1 0 09711661
However, this index information is not present in a WN-LMF file. It's not sufficient to look at the index of the <Sense> in a <LexicalEntry> because senses are grouped by case-insensitive lemmas in WNDB's index files but in WN-LMF they are kept separate:
The sense_number of the first lemma could be encoded in the Synset ID (e.g., wn30-01301316-04-s) or the ID could just be the NLTK-style names (wn30-severe.s.04). The problem with this is it becomes difficult or impossible to lookup synsets only by offset and ss_type, which is necessary for reading Information Content (IC) files. We might also need to parse the IDs to get the name again, when IDs should be opaque.
We could use the n attribute on <Sense> in the WN-LMF 1.1 "relaxed" schema to encode the sense_number. The problem with this is it legitimizes a crutch to keep compatibility with WNDB databases, while I feel that WN-LMF should stand on its own. Compatibility is the reason for this issue, but at the same time I'd like to remain forward-thinking. I'm also not sure that's the intended use of the attribute (I thought it was just for sense ranking in synsets; that is, the members attribute on <Synset> obviated the need for n on <Sense>).
We could use something like dc:identifier to encode the NLTK-style name, e.g., <Synset ... dc:identifier="severe.s.04">. The problem with this is it's assigning an ad hoc interpretation to a generic attribute.
Of these, I have a slight preference for 3 over 2, while 1 is just a non-proposal to illustrate why it's a bad idea.
The NLTK uses names like
spartan.a.01
for convenient synset lookup. To facilitate compatibility with the NLTK when using a WN-LMF lexicon, I'd like to be able to recreate or store these names. The lemma part is the first lemma listed in a WNDB data file (synset) entry, e.g.:Compare to:
These lemmas can be obtained by looking at the first sense in the
members
attribute of a<Synset>
in a WN-LMF 1.1 lexicon.The numeric portion of the name is the 1-based index of the synset offset for that lemma in the index file (e.g., see that the 01301316 sense of "severe" is the 4th offset in the index, hence
severe.s.04
):However, this index information is not present in a WN-LMF file. It's not sufficient to look at the index of the
<Sense>
in a<LexicalEntry>
because senses are grouped by case-insensitive lemmas in WNDB's index files but in WN-LMF they are kept separate:Proposals
wn30-01301316-04-s
) or the ID could just be the NLTK-style names (wn30-severe.s.04
). The problem with this is it becomes difficult or impossible to lookup synsets only by offset and ss_type, which is necessary for reading Information Content (IC) files. We might also need to parse the IDs to get the name again, when IDs should be opaque.n
attribute on<Sense>
in the WN-LMF 1.1 "relaxed" schema to encode the sense_number. The problem with this is it legitimizes a crutch to keep compatibility with WNDB databases, while I feel that WN-LMF should stand on its own. Compatibility is the reason for this issue, but at the same time I'd like to remain forward-thinking. I'm also not sure that's the intended use of the attribute (I thought it was just for sense ranking in synsets; that is, themembers
attribute on<Synset>
obviated the need forn
on<Sense>
).dc:identifier
to encode the NLTK-style name, e.g.,<Synset ... dc:identifier="severe.s.04">
. The problem with this is it's assigning an ad hoc interpretation to a generic attribute.Of these, I have a slight preference for 3 over 2, while 1 is just a non-proposal to illustrate why it's a bad idea.