goodmami / wn

A modern, interlingual wordnet interface for Python
https://wn.readthedocs.io/
MIT License
209 stars 20 forks source link

Most common word sense #111

Closed williamjr closed 3 years ago

williamjr commented 3 years ago

Hello,

Thank you for this fantastic project. I am curious whether or how wn exposes information pertaining to the frequency of a sense? My use case is that I'd like to look up a potentially polysemous word and return its most common sense, as a baseline approach to word sense disambiguation. Apologies if I've missed this in the documentation.

Thank you!

goodmami commented 3 years ago

Hi, thanks for the kind words.

Good and bad news.

Sense frequencies are in fact modeled, but you only missed it in the documentation because I forgot to add the relevant method to the documentation. Thanks for bringing attention to this, and I've now pushed a fix for the docs in 44d01d4b1e56c36ef45c816d8c8733cde5f69104. You can now see the API documentation for Sense.counts() (you may need to refresh the page if the old one is cached).

The bad news is that this depends on the lexicon actually containing the count information, which none of the wordnets indexed by Wn do. I believe these counts are used internally by the Open Multilingual Wordnet and are not currently exported in the files that Wn uses.

Some other things:

Does this help?

williamjr commented 3 years ago

Thank you! It helps to know that this is accounted for in the wn API.

That said, if none of its indexed lexicons actually supply the data that the counts method would expose, then it seems I am no better off when it comes to finding the most frequent sense...

Any chance you have a lead on how/where to source sense frequencies for many languages? I would consider writing a wrapper if these were accessible.

goodmami commented 3 years ago

Getting sense counts requires corpora annotated with disambiguated senses. This data doesn't exist for very many languages, and it doesn't transfer well across languages (as the sense annotations are specific to a word and not just a concept in the original language).

The NLTK's wordnet module has the Lemma.count() method which reads the WNDB's cntlist.rev file that comes with the Princeton WordNet. One could create a wrapper for this file, but the sense keys used by cntlist.rev are not (yet?) used as the primary IDs for senses in WN-LMF files (used by Wn) and the sense keys are instead tucked into the senses' metadata, so it wouldn't be straightforward to retrieve a sense by this ID.

Also, the above-linked WNDB documentation page contains this disclaimer:

Princeton no longer maintains or releases the Semantic Concordance files. The cntlist file used to order the senses in WordNet 3.0 was generated from the Semantic Concordance files at the point that they were last updated in 2001. In general, the order of senses presented usually reflects what the user would expect, however, sense ordering is now less reliable than in prior releases and should not be construed as an accurate indicator of frequency of use.

This generalizes to any sense count: they are only representative of the data used to produce the counts. This means they become dated, they are incomplete (not all senses have counts), and they only represent the domains of the data.

There is also the NTU Mulitlingual Corpus, which has sense annotations for multiple languages, but I don't see any way to download the data. You'd have to ask Francis (find his email address on the NTUMC website) for instructions.

This all may sound a bit discouraging, but if you have the cntlist.rev file and the Princeton WordNet 3.0 loaded in Wn, you can wrap it like the following, provided you only care about a slightly dated version of English. If the counts are based on sense keys and the sense keys are in the Sense metadata (as is currently the case), you'll need a mapping from sense keys to IDs. Make sure you have the PWN 3.0 wordnet:

>>> import wn
>>> wn.download('pwn:3.0')
>>> pwn = wn.Wordnet('pwn:3.0')

Here's an example of the data we'll use:

>>> dog = pwn.senses('dog', pos='n')[0]
>>> dog.metadata()
{'identifier': 'dog%1:05:00::'}
>>> dog.id
'pwn-dog-n-hs-02084071'

Create the mapping like this:

>>> sense_key_map = {
...     val: s.id
...     for s in pwn.senses()
...     for key, val in s.metadata().items()
...     if key == 'identifier'
... }
>>> pwn.sense(sense_key_map['dog%1:05:00::'])
Sense('pwn-dog-n-hs-02084071')

Then you can read the cntlist.rev file like this:

>>> import os
>>> counts = {}
>>> with open(os.path.expanduser('~/nltk_data/corpora/wordnet/cntlist.rev')) as f:
...     for line in f:
...         key, _, cnt = line.rstrip().split()
...         counts[key] = int(cnt)
... 
>>> counts['dog%1:05:00::']
42
goodmami commented 3 years ago

I'm going to close this issue. Feel free to reopen or just comment if you have follow-up questions.