goodmami / wn

A modern, interlingual wordnet interface for Python
https://wn.readthedocs.io/
MIT License
197 stars 19 forks source link

Is there any mapping between different English wordnet? #176

Closed rudaoshi closed 1 year ago

rudaoshi commented 1 year ago

There have been may English wordnets and I wonder whether there is any mapping between the ids of synsets in these wordnets, for example, oewn/ewn <-> omw.

If there is, please tell me how to get the mapping.

Thank you ~

fcbond commented 1 year ago

Hi,

the different versions of the Princeton wordnet use sensekeys to link senses: they are meant to be stable between versions, although there have been occasional surprises (sometimes different capitalization has caused issues). OEWN and OMW use the ILI keys to link synsets:

Francis Bond, Piek Vossen, John McCrae, and Christiane Fellbaum (2016) CILI: the Collaborative Interlingual Index. In Proceedings of the 8th Global WordNet Conference (GWC2016), Bucharest. pp 50–57 https://aclanthology.org/2016.gwc-1.9/

On Tue, 18 Oct 2022 at 15:15, 孙明明 @.***> wrote:

There have been may English wordnets and I wonder whether there is any mapping between the ids of synsets in these wordnets, for example, oewn/ewn <-> omw.

If there is, please tell me how to get the mapping.

Thank you ~

— Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/176, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRQWNG3EMTFKSPIJCWDWD2PIZANCNFSM6AAAAAARICOBEE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Francis Bond https://fcbond.github.io/

goodmami commented 1 year ago

@rudaoshi, to add to what @fcbond said, in Wn you can use the ili member of a synset to see equivalent synsets across versions or even across lexicons for another language:

>>> import wn
>>> oewn = wn.Wordnet('oewn')
>>> wn30 = wn.Wordnet('omw-en')
>>> oewn.synsets('penumbra')[0].ili
ILI('i110430')
>>> wn30.synsets('penumbra')[0].ili
ILI('i110430')
>>> wn30.synsets(ili='i110430')[0].lemmas()
['penumbra']
>>> wnja = wn.Wordnet('omw-ja')
>>> wnja.synsets(ili='i110430')[0].lemmas()
['半影']

For the omw-en lexicons (which are directly converted from the Princeton WordNet with very few changes), the sensekeys are available as the identifier metadata of senses, but these are not available for other lexicons:

>>> wn30.senses('penumbra')[0].metadata()
{'identifier': 'penumbra%1:26:00::'}
>>> oewn.senses('penumbra')[0].metadata()
{}
>>> wnja.senses('半影')[0].metadata()
{}
ekaf commented 1 year ago

Thanks @goodmami and @fcbond . I did not understand this correctly before, but now, I think I start to get a more accurate picture of the implicit "mapping" in Wn. Actually, it seems that Wn does no mapping by itself, but loads resources that were previously mapped to ILI. This mapping was done by external projects: OMW mapped the multilingual wordnets using the ili-map-pwn30.tab file from CILI-1.0, while OEWN used the corresponding pwn31 mapping. Joining these mappings gives an intersection of 117583 identifiers, while the recall in OEWN 2021 is only 117441.

import wn

def ili_loss(wnstring1, wnstring2):
# WN 1:
    wn1 = wn.Wordnet(wnstring1)
    v1 = wn1.lexicons()
    i1 = wn1.ilis()
    n1 = len(i1)
    print(f"{v1}: {n1} synsets")
# WN 2:
    wn2 = wn.Wordnet(wnstring2)
    v2 = wn2.lexicons()
    i2 = wn2.ilis()
    n2 = len(i2)
    print(f"{v2}: {n2} synsets")
# Intersection:
    ii = set(i1).intersection(i2)
    ni = len(ii)
    print(f"Intersection: {ni} synsets")
    loss = n1 - ni
    pct = 100 * loss/n1
    print(f"Loss: {loss} synsets ({round(pct,2)})%")

ili_loss('omw-en', 'oewn')

[<Lexicon omw-en:1.4 [en]>]: 117659 synsets [<Lexicon oewn:2021 [en]>]: 120039 synsets Intersection: 117441 synsets Loss: 218 synsets (0.19)%

ili_loss('omw-ja', 'oewn')

[<Lexicon omw-ja:1.4 [ja]>]: 57184 synsets [<Lexicon oewn:2021 [en]>]: 120039 synsets Intersection: 57076 synsets Loss: 108 synsets (0.19)%

ili_loss('omw-arb', 'oewn')

[<Lexicon omw-arb:1.4 [arb]>]: 9916 synsets [<Lexicon oewn:2021 [en]>]: 120039 synsets Intersection: 9887 synsets Loss: 29 synsets (0.29)%

I suppose that a part (though not all) of this difference can be attributed to #179.

goodmami commented 1 year ago

It seems like the original question has been answered.