Addition to NLTK migration guide w.r.t. offsets

BramVanroy commented 1 year ago

Is your feature request related to a problem? Please describe. Hello

I have access to WordNet synset offset IDs that I retrieve from an API (key: wnSynsetOffset). They look like this wn:00981304a. It is relatively straightforward to get these through NLTK:

from nltk.corpus import wordnet as nltk_wn

offset = "wn:00981304a"
offset_id = int(offset.split(":")[-1][:-1])
pos = offset[-1]
syns = nltk_wn.synset_from_pos_and_offset(pos, offset_id)

However, it is not clear to me how I can convert this approach to wn. I like the API of wn more and I would like to make use of the translate feature specifically, so that is why I want to make the transition.

Describe the solution you'd like Perhaps a description in the documentation? I think that this section is relevant but it is not clear to me how to apply it on a use-case. So a real-world example can be helpful, I think.

Describe alternatives you've considered

I have tried the following manipulations but none of them work (yielding empty synset lists):


wn.synsets("wn:00981304a")
wn.synsets("00981304a")
wn.synsets("981304a")
wn.synsets("981304", pos="a")

fcbond commented 1 year ago

Hi,

if you have a wordnet derived from PWN 3.0 with the same offsets, then it can be done as follows:

>>> import wn
>>> ewn=wn.WordNet('omw-en:1.4')
>>> ewn.synset(f'omw-en-00981304-s')
Synset('omw-en-00981304-s')

Many people (including omw 1.0) treat all satellite adjectives (pos 's') as adjectives (pos 'a'). wn does not, so if you look up something with pos 'a' and it doesn't work, then it is worth also looking up 's'. So something like the following should get you what you want.

def offset2synset (wn, offset):
  wnid=  f'omw-en-{offset[3:-1]}-{offset[-1]}'
  try:
    synset = wn.synset(wnid)
  except:
    if offset[-1] == 'a':
       wnid=  f'omw-en-{offset[3:-1]}-s' 
       try:
         synset =  wn.synset(wnid)
       except:
         synset = None
    else:
      synset = None
  return synset

>>> print(offset2synset(ewn, 'wn:00981304a'))
Synset('omw-en-00981304-s')
>>> print(offset2synset(ewn, 'wn:02001858v'))
Synset('omw-en-02001858-v')

goodmami commented 1 year ago

@BramVanroy thanks for the good questions (here and on the https://github.com/goodmami/penman project, too :wave:). I agree that the documentation could be improved in this area, possibly in the NLTK migration guide.

And thanks, @fcbond, for the good description and solution.

The basic problem is that synset offsets (which are specific to each wordnet version) are not an inherent part of the WN-LMF formatted lexicons that are used by Wn, but for some lexicons (mainly the omw- ones), the WordNet 3.0 offsets are conventionally used in the synset identifiers, so you just need to reformat the identifier appropriately, as @fcbond demonstrated.

Note that I also have an unmerged nltk branch that tries to implement the NLTK's API as a shim on top of Wn, and its of2ss() function is implemented using the same wn.util.synset_id_formatter() function you linked to above:

https://github.com/goodmami/wn/blob/5092e62784e3279295385436efffa5b5a5ab0346/wn/nltk_api.py#L329-L342

@fcbond said:

Many people (including omw 1.0) treat all satellite adjectives (pos 's') as adjectives (pos 'a'). wn does not

This is not entirely true. Wn does conflate s and a in the wn.ic, wn.morphy, wn.similarity, and wn.taxonomy modules, but it's true that it does not do so on the standard synset-lookup functions.

BramVanroy commented 1 year ago

Hello @fcbond and @goodmami

First, thanks for the help! I settled for this:

def offset2omw_synset(wnet: wn.Wordnet, offset: str) -> Optional[wn.Synset]:
    offset = offset.replace("wn:", "")
    offset = "0" * (9-len(offset)) + offset
    wnid = f"omw-en-{offset[:-1]}-{offset[-1]}"
    wnid_s = None

    try:
        return wnet.synset(wnid)
    except wn.Error:
        if wnid[-1] == "a":
            wnid_s = f"omw-en-{wnid[:-2]}-s"
            try:
                return wnet.synset(wnid_s)
            except wn.Error:
                pass

    logging.warning(f"Could not find offset {offset} ({wnid}{' or ' + wnid_s if wnid_s else ''}) in {wnet._lexicons}")

I looked at the NLTK branch @goodmami and while I think that would be very useful, I just needed a quick function that I could easily plug into my code (without having to install from GitHub). But I think it'd be a useful API to have - although I can imagine it is a lot of work!

And thank you for your work. It seems a coincidence that you are providing exactly the tools that I need for my work. I am very thankful and motivated that you created these libraries - and that they work so well and are well-documented! I've also peeked at the internals/API and documentation to inspire my own work, so a big thank you!

goodmami commented 1 year ago

Thanks for the kind words, @BramVanroy! And I'm glad you were able to find a solution. I'm going to keep the issue open because, as the issue title states, I think this sort of information would be useful in the documentation, so the issue should be closed when that happens.

goodmami / wn

Addition to NLTK migration guide w.r.t. offsets #183