Closed rudaoshi closed 1 year ago
Hi,
the different versions of the Princeton wordnet use sensekeys to link senses: they are meant to be stable between versions, although there have been occasional surprises (sometimes different capitalization has caused issues). OEWN and OMW use the ILI keys to link synsets:
Francis Bond, Piek Vossen, John McCrae, and Christiane Fellbaum (2016) CILI: the Collaborative Interlingual Index. In Proceedings of the 8th Global WordNet Conference (GWC2016), Bucharest. pp 50–57 https://aclanthology.org/2016.gwc-1.9/
On Tue, 18 Oct 2022 at 15:15, 孙明明 @.***> wrote:
There have been may English wordnets and I wonder whether there is any mapping between the ids of synsets in these wordnets, for example, oewn/ewn <-> omw.
If there is, please tell me how to get the mapping.
Thank you ~
— Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRQWNG3EMTFKSPIJCWDWD2PIZANCNFSM6AAAAAARICOBEE . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Francis Bond https://fcbond.github.io/
@rudaoshi, to add to what @fcbond said, in Wn you can use the ili
member of a synset to see equivalent synsets across versions or even across lexicons for another language:
>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> wn30 = wn.Wordnet('omw-en')
>>> oewn.synsets('penumbra')[0].ili
ILI('i110430')
>>> wn30.synsets('penumbra')[0].ili
ILI('i110430')
>>> wn30.synsets(ili='i110430')[0].lemmas()
['penumbra']
>>> wnja = wn.Wordnet('omw-ja')
>>> wnja.synsets(ili='i110430')[0].lemmas()
['半影']
For the omw-en
lexicons (which are directly converted from the Princeton WordNet with very few changes), the sensekeys are available as the identifier
metadata of senses, but these are not available for other lexicons:
>>> wn30.senses('penumbra')[0].metadata()
{'identifier': 'penumbra%1:26:00::'}
>>> oewn.senses('penumbra')[0].metadata()
{}
>>> wnja.senses('半影')[0].metadata()
{}
Thanks @goodmami and @fcbond . I did not understand this correctly before, but now, I think I start to get a more accurate picture of the implicit "mapping" in Wn. Actually, it seems that Wn does no mapping by itself, but loads resources that were previously mapped to ILI. This mapping was done by external projects: OMW mapped the multilingual wordnets using the ili-map-pwn30.tab file from CILI-1.0, while OEWN used the corresponding pwn31 mapping. Joining these mappings gives an intersection of 117583 identifiers, while the recall in OEWN 2021 is only 117441.
import wn
def ili_loss(wnstring1, wnstring2):
# WN 1:
wn1 = wn.Wordnet(wnstring1)
v1 = wn1.lexicons()
i1 = wn1.ilis()
n1 = len(i1)
print(f"{v1}: {n1} synsets")
# WN 2:
wn2 = wn.Wordnet(wnstring2)
v2 = wn2.lexicons()
i2 = wn2.ilis()
n2 = len(i2)
print(f"{v2}: {n2} synsets")
# Intersection:
ii = set(i1).intersection(i2)
ni = len(ii)
print(f"Intersection: {ni} synsets")
loss = n1 - ni
pct = 100 * loss/n1
print(f"Loss: {loss} synsets ({round(pct,2)})%")
ili_loss('omw-en', 'oewn')
[<Lexicon omw-en:1.4 [en]>]: 117659 synsets [<Lexicon oewn:2021 [en]>]: 120039 synsets Intersection: 117441 synsets Loss: 218 synsets (0.19)%
ili_loss('omw-ja', 'oewn')
[<Lexicon omw-ja:1.4 [ja]>]: 57184 synsets [<Lexicon oewn:2021 [en]>]: 120039 synsets Intersection: 57076 synsets Loss: 108 synsets (0.19)%
ili_loss('omw-arb', 'oewn')
[<Lexicon omw-arb:1.4 [arb]>]: 9916 synsets [<Lexicon oewn:2021 [en]>]: 120039 synsets Intersection: 9887 synsets Loss: 29 synsets (0.29)%
I suppose that a part (though not all) of this difference can be attributed to #179.
It seems like the original question has been answered.
There have been may English wordnets and I wonder whether there is any mapping between the ids of synsets in these wordnets, for example, oewn/ewn <-> omw.
If there is, please tell me how to get the mapping.
Thank you ~