Closed goodmami closed 4 years ago
This is awkward... The wn
library is using Princeton WordNet 3.3 but OMW lemmas seem to be indexed with 3.0 =(
Okay, I've confirmed that the synset offset of Princeton 3.0 and 3.3 is different for wn.sysnet('dog.n.01')
.
>>> from wn import WordNet
>>> wn = WordNet()
>>> wn.synset('dog.n.01').offset()
2088569
>>> from nltk.corpus import wordnet as nltk_wn
>>> nltk_wn.synset('dog.n.01').offset()
2084071
And in raw files:
~/git-stuff/wordnet/wn/data$ grep "2088569" wordnet-3.3/index.noun
canis_familiaris n 1 4 @ #m ~ %p 1 0 02088569
dog n 7 5 ~ %p #m @ #p 7 0 02088569 10143371 10052157 09915066 07696464 03910262 02714792
domestic_dog n 1 4 @ #m ~ %p 1 0 02088569
~/git-stuff/wordnet/wn/data$ grep "2084071" wordnet-3.0/index.noun
canis_familiaris n 1 4 @ ~ #m %p 1 0 02084071
dog n 7 5 @ ~ #m #p %p 7 1 02084071 10114209 10023039 09886220 07676602 03901548 02710044
domestic_dog n 1 4 @ ~ #m %p 1 0 02084071
Yes, at some point I think OMW plans to update to a newer index, but I'm not sure exactly when or for which version. @fcbond should know.
Any idea why would the indices of synsets be move?
And what's the behavior of the indexing system? I.e. if a synset is replaced with another offset, is the old one obsolete? Any docs on how these indices/offset are organized? @jmccrae @fcbond
Also, is there an official mapping from 3.0 -> 3.3 ?
Maybe backing off to WN 3.0 is sound for now, esp. since the info content data are based on the 3.0 indices.
Not sure I understand enough about this project to comment too much. There is no Princeton WordNet 3.3 (or 3.2 for that matter). The last version is 3.1. There is an English WordNet release that has built on Princeton WordNet, so maybe this is what you are referring to.
At any rate, by design each new version of Princeton WordNet changes all the synset identifier numbers. ( not a good design)
Thanks @jmccrae for the clarification!!!
Oh, https://github.com/globalwordnet/english-wordnet is not Princeton? I was using the link from https://github.com/globalwordnet/english-wordnet that pointed to (broken now though) http://server1.nlp.insight-centre.org/enwordnet-update/english-wordnet-3.3.zip , assuming that that is Princeton @_@
English WordNet is not Princeton (it is a fork as described on the README). The link you are using is to a live (and continuously changing) development version of the English WordNet. The link should be back in a few minutes as it is auto-generated from commits
Got it @jmccrae. Thank you again for helping to clarify!
@goodmami Backing off all versions to Princeton 3.0 first so that all compatibility are kept with OMW and we'll try to put in plans to incorporate other wordnets, e.g. English WordNet, in the future.
Resolved by #21 and latest version wn==0.0.23
should resolve this.
Thank you @goodmami @jmccrae !!
When I try and request a synset in another language (I've tried
jpn
andspa
), I get the following error:I confirmed that
10641755-n
does exist in the corresponding.tab
file and that the line is read in, but I wasn't able to figure out why it doesn't make it to the internal dictionaries. I thought it might be a failure to cast the offset to an int in some place, but that didn't seem to be the case. The abuse of__builtins__
made the code a bit hard to follow.