nltk / wordnet

Stand-alone WordNet API
Other
47 stars 16 forks source link

Errors when requesting synsets for other languages #20

Closed goodmami closed 4 years ago

goodmami commented 4 years ago

When I try and request a synset in another language (I've tried jpn and spa), I get the following error:

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synsets('犬', lang='jpn')
Traceback (most recent call last):
  File "/home/mwg/repos/wordnet/wn/__init__.py", line 113, in synset_from_pos_and_offset
    return _synset_offset_cache[pos][offset]
KeyError: 10641755

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mwg/repos/wordnet/wn/__init__.py", line 165, in synsets
    for p, offset in set(list_of_offsets)]
  File "/home/mwg/repos/wordnet/wn/__init__.py", line 165, in <listcomp>
    for p, offset in set(list_of_offsets)]
  File "/home/mwg/repos/wordnet/wn/__init__.py", line 119, in synset_from_pos_and_offset
    raise WordNetError('Part-of-Speech and Offset combination not found in WordNet: {} + {}'.format(pos, offset))
wn.utils.WordNetError: Part-of-Speech and Offset combination not found in WordNet: n + 10641755

I confirmed that 10641755-n does exist in the corresponding .tab file and that the line is read in, but I wasn't able to figure out why it doesn't make it to the internal dictionaries. I thought it might be a failure to cast the offset to an int in some place, but that didn't seem to be the case. The abuse of __builtins__ made the code a bit hard to follow.

alvations commented 4 years ago

This is awkward... The wn library is using Princeton WordNet 3.3 but OMW lemmas seem to be indexed with 3.0 =(

Okay, I've confirmed that the synset offset of Princeton 3.0 and 3.3 is different for wn.sysnet('dog.n.01').

>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synset('dog.n.01').offset()
2088569

>>> from nltk.corpus import wordnet as nltk_wn
>>> nltk_wn.synset('dog.n.01').offset()
2084071

And in raw files:

~/git-stuff/wordnet/wn/data$ grep "2088569" wordnet-3.3/index.noun 
canis_familiaris n 1 4 @ #m ~ %p 1 0 02088569
dog n 7 5 ~ %p #m @ #p 7 0 02088569 10143371 10052157 09915066 07696464 03910262 02714792
domestic_dog n 1 4 @ #m ~ %p 1 0 02088569

~/git-stuff/wordnet/wn/data$ grep "2084071" wordnet-3.0/index.noun 
canis_familiaris n 1 4 @ ~ #m %p 1 0 02084071  
dog n 7 5 @ ~ #m #p %p 7 1 02084071 10114209 10023039 09886220 07676602 03901548 02710044  
domestic_dog n 1 4 @ ~ #m %p 1 0 02084071 
goodmami commented 4 years ago

Yes, at some point I think OMW plans to update to a newer index, but I'm not sure exactly when or for which version. @fcbond should know.

alvations commented 4 years ago

Any idea why would the indices of synsets be move?

And what's the behavior of the indexing system? I.e. if a synset is replaced with another offset, is the old one obsolete? Any docs on how these indices/offset are organized? @jmccrae @fcbond

Also, is there an official mapping from 3.0 -> 3.3 ?

alvations commented 4 years ago

Maybe backing off to WN 3.0 is sound for now, esp. since the info content data are based on the 3.0 indices.

jmccrae commented 4 years ago

Not sure I understand enough about this project to comment too much. There is no Princeton WordNet 3.3 (or 3.2 for that matter). The last version is 3.1. There is an English WordNet release that has built on Princeton WordNet, so maybe this is what you are referring to.

At any rate, by design each new version of Princeton WordNet changes all the synset identifier numbers. ( not a good design)

alvations commented 4 years ago

Thanks @jmccrae for the clarification!!!

Oh, https://github.com/globalwordnet/english-wordnet is not Princeton? I was using the link from https://github.com/globalwordnet/english-wordnet that pointed to (broken now though) http://server1.nlp.insight-centre.org/enwordnet-update/english-wordnet-3.3.zip , assuming that that is Princeton @_@

jmccrae commented 4 years ago

English WordNet is not Princeton (it is a fork as described on the README). The link you are using is to a live (and continuously changing) development version of the English WordNet. The link should be back in a few minutes as it is auto-generated from commits

alvations commented 4 years ago

Got it @jmccrae. Thank you again for helping to clarify!

@goodmami Backing off all versions to Princeton 3.0 first so that all compatibility are kept with OMW and we'll try to put in plans to incorporate other wordnets, e.g. English WordNet, in the future.

alvations commented 4 years ago

Resolved by #21 and latest version wn==0.0.23 should resolve this.

Thank you @goodmami @jmccrae !!