goodmami / wn

A modern, interlingual wordnet interface for Python
https://wn.readthedocs.io/
MIT License
199 stars 19 forks source link

Add information content (IC) #40

Closed goodmami closed 3 years ago

goodmami commented 3 years ago

Three of the similarity measures require information content to work. The IC that is shipped with the NLTK's wordnet data is based on synset offsets, so those will need to be mapped somehow to something that this module uses.

fcbond commented 3 years ago

May I suggest ili numbers?

goodmami commented 3 years ago

I'm not yet sure how IC works, but I'm happy to use that if it's sufficient. With ILI, instead of synset identifiers, I could imagine two wordnets for the same language (PWN vs English Wordnet, Italian, Chinese perhaps) could use the same IC for some corpus.

arademaker commented 3 years ago

What is IC here? May I suggest consider the glosstag version that I am completing https://github.com/own-pt/glosstag

I would be happy to discuss the better format for releasing it. Most updated branch is AR

goodmami commented 3 years ago

An IC file has data like this:

6484n 87

Where 6484n is synset offset 6484 and pos n in PWN 3.0. Then, 87 is the value associated with that synset, computed from occurrences of words in a corpus matching synsets. We can map the synsets to ILIs, which makes them more portable across English wordnet versions, and also makes them useful for other languages, but it should be noted that these values came from English corpora. This is important to note because not only would we expect a different distribution of ILIs across corpora in different languages, but the values would be computed differently because it depends on how many senses each word has (each occurrence increments the value by 1/n where n is the number of senses for the word). To be more accurate, the numbers can change across English wordnet versions, too, but we wouldn't expect such a drastic change.

Once this data is mapped and distributed, the next questions are how to work with it in Wn:

goodmami commented 3 years ago

@arademaker Would your glosstag data serve as the source data to generate information content files? Or if you envision some other use, perhaps open another issue as that data seems different from the standard information content files.

goodmami commented 3 years ago

@fcbond in the implementation to be released, I use the synset IDs for the internal mapping, and I use an ID-mapping function to get those synset IDs from the old offset+pos encoding. ILIs wouldn't be good because not all synsets have ILIs, but all can receive information content weights, and also I think we should discourage (or at least warn against) reusing information content between different wordnets.

fcbond commented 3 years ago

I agree that that makes sense. In fact, even for different versions of the same wordent, the number of senses may change, which will affect the calculations, ...

On Thu, Jun 24, 2021 at 2:54 PM Michael Wayne Goodman < @.***> wrote:

@fcbond https://github.com/fcbond in the implementation to be released, I use the synset IDs for the internal mapping, and I use an ID-mapping function to get those synset IDs from the old offset+pos encoding. ILIs wouldn't be good because not all synsets have ILIs, but all can receive information content weights, and also I think we should discourage (or at least warn against) reusing information content between different wordnets.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/goodmami/wn/issues/40#issuecomment-867330104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRXJTKDZXWAZUFL6DE3TUKZ35ANCNFSM4TK22C2Q .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University