nltk / nltk_data

NLTK Data
1.43k stars 1.03k forks source link

Fix WordNet 3.0 gloss inconsistencies #160

Open genericallyterrible opened 2 years ago

genericallyterrible commented 2 years ago

@fcbond, @stevenbird There are several consistency issues with the gloss portions of WordNet 3.0 making parsing difficult. Would it be possible for us to manually fix these issues without breaking word associations as seen with the problems currently facing the update to WordNet 3.1?

fcbond commented 2 years ago

G'day,

we have tried to fix these in the new English wordnet, and there is a good python interface: https://github.com/globalwordnet/english-wordnet https://pypi.org/project/wn/

I think it makes more sense to move to this than try to port backfixes to 3.0

On Sat, Sep 11, 2021 at 12:08 AM John Merkel @.***> wrote:

@fcbond https://github.com/fcbond, @stevenbird https://github.com/stevenbird There are several consistency issues with the gloss portions of WordNet 3.0 making parsing difficult https://github.com/nltk/nltk/issues/2527#issuecomment-917009092. Would it be possible for us to manually fix these issues without breaking word associations as seen with the problems currently facing the update to WordNet 3.1 https://github.com/nltk/nltk_data/issues/18?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/issues/160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRXGQNEWLPYE7MBQ2A3UBISF5ANCNFSM5DZTMIJA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 2 years ago

Would it be possible for us to manually fix these issues without breaking word associations [...]

Replying specifically to this: It is incredibly difficult to alter WNDB data without breaking things, as the synset IDs are byte-offsets in the file, so any modified gloss has to have the same number of bytes as before. Secondly, we're not allowed to change the Princeton WordNet data and still call it as such (it would have to be called the "NLTK Wordnet of English" or something).

the problems currently facing the update to WordNet 3.1?

That issue was closed 2 years ago, which suggests to me that there are no plans to add WordNet 3.1 to the NLTK. There was an attempt at adding next-generation wordnet support to, or alongside, the NLTK (see https://github.com/nltk/wordnet), and it included WordNet 3.1 data as an option. Development stalled, however, so I took over the effort (and package name on PyPI) with an entirely new module, which Francis has linked above.

stevenbird commented 2 years ago

@goodmami, thanks for the update. This sounds like a more sustainable option. How easily could a user of the NLTK wordnet package port their code to use your package? Does it include the similarity metrics?

fcbond commented 2 years ago

Hi,

in general I think it is quite easy to port the code. The documentation has some notes on migration from the current interface: https://wn.readthedocs.io/en/latest/guides/nltk-migration.html

It does have the similarity metrics. https://wn.readthedocs.io/en/latest/api/wn.similarity.html

@goodmami did a lot of work :-).

On Tue, Sep 14, 2021 at 10:05 AM Steven Bird @.***> wrote:

@goodmami https://github.com/goodmami, thanks for the update. This sounds like a more sustainable option. How easily could a user of the NLTK wordnet package port their code to use your package? Does it include the similarity metrics?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/issues/160#issuecomment-918730115, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRQSAL3N6H26SESJEHLUB2UCTANCNFSM5DZTMIJA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 2 years ago

Thanks, @fcbond!

@stevenbird, Wn has the similarity metrics, information content (it even reads the wordnet_ic files from nltk_data), Morphy, etc. Some absent features that may be desired are looking things up by sense keys (e.g., eat%2:34:02::; workaround) or the NLTK's shorthand synset identifiers (feed.v.06). If you wish to discuss a plan for deprecating the NLTK's wordnet module in favor of Wn, we should open separate issues to track the necessary changes to the code, data, documentation, and book.

Back to the current issue: in the modern WN-LMF format for wordnets, Definition and Example elements are structurally separate, having been split from WNDB's combined "gloss" line in the format-conversion process. That process, however, may not account for the inconsistencies noted by @genericallyterrible, who did a nice and thorough analysis in nltk/nltk#2527. So as to not let that effort go to waste, it might be good to compare it with the WNDB-to-LMF converter. The relevant code is here.