nltk / wordnet

Stand-alone WordNet API
Other
47 stars 16 forks source link

Specify language when instantiating WordNet object #19

Closed goodmami closed 3 years ago

goodmami commented 4 years ago

As I understand, creating a WordNet object always loads the English data, and if you call a method with lang=xyz where xyz is not 'eng', it also loads the data for that language.

I wonder why it doesn't just make lang a parameter for the WordNet class, so it only loads the data for that language, then remove the parameter on any of its methods. This might also help to avoid some if lang='eng' checks within the functions. Then it would just be a matter of instantiating a new WordNet object if one wants to work with multiple wordnets:

>>> pwn = WordNet(lang='eng')  # or just WordNet(), perhaps
>>> pwn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.02'), Synset('frank.n.01'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]
>>> jwn = WordNet(lang='jpn')
>>> jwn.synsets('犬')
[Synset('dog.n.01'), Synset('spy.n.01')]

(note: this example is illustrative; I cannot actually query for '犬' in Japanese because of #20 )

Of course, this would break some backward compatibility.

alvations commented 4 years ago

IIRC, WordNet concepts are not meant to have "languages", the language comes only at the lemma level. Synsets are considered to be technically "universal" and lemmas are realization of synsets in specific languages.

But I'm not good with the WordNet philosophy. Maybe @fcbond has a better idea.

goodmami commented 4 years ago

My proposal was more for practical purposes. Currently, when creating a WordNet object it loads the English data, even if the user never wanted to use English. Only when they try something like ss.lemmas(lang=...) does the language get loaded. This seems inefficient to me.

If someone cares about using the universality to, e.g., lookup a synset from a wordnet in one language and then list lemmas in another language, it seems clearer to do something like this:

>>> pwn = WordNet(lang='eng')
>>> jwn = WordNet(lang='jpn')
>>> for ss in pwn.synsets('dog'):
...     for lemma in jwn.synset(ss).lemmas():
...         print(lemma)

But I, too, would like to get Francis's take here. @fcbond, care to comment?

fcbond commented 4 years ago

In OMW 1.0, the structure comes entirely from PWN, so without the English wordnet, there are no semantic relations, and no synset nodes to attach the lemmas to, So we have to load English first.

On Mon, Apr 27, 2020 at 1:31 PM Michael Wayne Goodman < notifications@github.com> wrote:

My proposal was more for practical purposes. Currently, when creating a WordNet object it loads the English data, even if the user never wanted to use English. Only when they try something like ss.lemmas(lang=...) does the language get loaded. This seems inefficient to me.

If someone cares about using the universality to, e.g., lookup a synset from a wordnet in one language and then list lemmas in another language, it seems clearer to do something like this:

pwn = WordNet(lang='eng')>>> jwn = WordNet(lang='jpn')>>> for ss in pwn.synsets('dog'):... for lemma in jwn.synset(ss).lemmas():... print(lemma)

But I, too, would like to get Francis's take here. @fcbond https://github.com/fcbond, care to comment?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/wordnet/issues/19#issuecomment-619731085, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRSGBYLPUYZ2V5YU7P3ROUKBDANCNFSM4JHV4DDA .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 4 years ago

Ok that makes sense then. Thanks for explaining.

It seems like we could still use the API I proposed above anticipating a world where the shared concept structure is detached from the PWN. It would allow us to perform operations for other languages without having to specify lang all the time. But it wouldn't gain anything in, e.g., space efficiency.

goodmami commented 3 years ago

I'm closing this as it's handled by https://github.com/goodmami/wn

>>> import wn
>>> wn.words('chat')  # returns both French and English
[Word('ewn-chat-n'), Word('ewn-chat-v'), Word('frawn-lex14803'), Word('frawn-lex21897')]
>>> ewn = wn.WordNet(lgcode='en')
>>> ewn.words('chat')  # only returns English
[Word('ewn-chat-n'), Word('ewn-chat-v')]