nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Update Extended OMW #183

Open ekaf opened 2 years ago

ekaf commented 2 years ago

This PR updates the "extended_omw" package with additional wordnets from the "wns" folder in the recent OMW 1.4 source release (retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip).

In particular, this PR corrects large numbers of errors in the Tosk Albanian ('als'), Standard Arabic ('als') and Castilian ('spa') wiktionary wordnets in the 'wikt' folder.

First added, but retracted again, following https://github.com/nltk/nltk_data/pull/183#issuecomment-1042811861 : Persian ("fas") and an alternative Chinese wordnet ("qcn") which are included in NLTK's "omw" package, but were left out of omw-1.4 because of quality concerns (cf. discussions at https://github.com/nltk/nltk_data/pull/171).

Everything in this PR was just copied verbatim from the upstream source release. As a consequence, all folders now include LICENSE and citation.bib files, so that the standard citation() and license() functions return appropriate information about the languages covered in extended_omw.

Sample use, assuming https://github.com/nltk/nltk/pull/2946:

import nltk from nltk.corpus import wordnet as wn print(f"Loaded Wordnet v. {wn.get_version()} with {len(wn.langs())} languages from OMW-1.4")

Loaded Wordnet v. 3.0 with 32 languages from OMW-1.4

wn.add_exomw() print(f"Loaded {len(wn.langs())} languages in total with Extended OMW")

Loaded 1192 languages in total with Extended OMW

ss=wn.synset('example.n.01') print(ss.lemma_names(lang="cmn"))

['事例', '例', '例子', '例证']

print(ss.lemma_names(lang="cmn_wikt"))

['例子', '例', '榜样', '例证']

Retracted:

print(ss.lemma_names(lang="qcn"))

['例子', '比方']

fcbond commented 2 years ago

Hi,

please do not redistribute the Persian and Chinese data, because of the quality issues. We asked you not to in #171, and you agreed not to, so I am surprised to see them here.

On Thu, Feb 17, 2022 at 9:37 PM Eric Kafe @.***> wrote:

This PR updates the "extended_omw" package with additional wordnets from the "wns" folder in the recent OMW 1.4 source release (retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip).

In particular, this PR adds Persian ("fas") and an alternative Chinese wordnet ("qcn") which are included in NLTK's "omw" package, but were left out of omw-1.4 because of quality concerns (cf. discussions at #171 https://github.com/nltk/nltk_data/pull/171).

Everything in this PR was just copied verbatim from the upstream source release. As a consequence, all folders now include LICENSE and citation.bib files, so that the standard citation() and license() functions return appropriate information about the languages covered in extended_omw.

Sample use, assuming nltk/nltk#2946 https://github.com/nltk/nltk/pull/2946:

import nltk from nltk.corpus import wordnet as wn print(f"Loaded Wordnet v. {wn.get_version()} with {len(wn.langs())} languages from OMW-1.4")

Loaded Wordnet v. 3.0 with 32 languages from OMW-1.4

wn.add_exomw() print(f"Loaded {len(wn.langs())} languages in total with Extended OMW")

Loaded 1194 languages in total with Extended OMW

ss=wn.synset('example.n.01') print(ss.lemma_names(lang="cmn"))

['事例', '例', '例子', '例证']

print(ss.lemma_names(lang="cmn_wikt"))

['例子', '例', '榜样', '例证']

print(ss.lemma_names(lang="qcn"))

['例子', '比方']

You can view, comment on, or merge this pull request online at:

https://github.com/nltk/nltk_data/pull/183 Commit Summary

File Changes

(2 files https://github.com/nltk/nltk_data/pull/183/files)

Patch Links:

— Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRWAI3AG2UB37UC44FLU3TFVTANCNFSM5OUIQHGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Francis Bond https://fcbond.github.io/ Division of Linguistics and Multilingual Studies Nanyang Technological University

ekaf commented 2 years ago

@fcbond: Of course I will remove these two languages if you insist. My understanding was that these files were ok to put in an omw-extra package, since you still distribute them in the OMW-data source release. Also, since they are still included in NLTK's old "omw" package, there is a concern that users might miss them when upgrading to newer NLTK versions, and that old code would break. Last, the "fas" and "qcn" wordnets are supported by scientific papers (cf. their "citation.bib"), so the quality issues might not be much worse than some other languages (for ex. French, which also has big quality issues). There may still be a concern about the vague licensing terms of the Persian wordnet, but maybe this could be resolved by asking the authors?

ekaf commented 2 years ago

According to https://github.com/nltk/nltk_data/pull/171#issuecomment-984387013, "Native speakers of Farsi and Mandarin have pointed out that these two resources have some quality issues".

It could be interesting to hear anything about the severity of the alleged issues.

And wouldn't the same argument apply to all wordnets? In particular, many quality issues have been reported about Princeton Wordnet. Issues are also often raised in OEWN. Discussing the issues openly is a way to eventually solve them...

ekaf commented 2 years ago

Two languages ('fas' and 'qcn') were retracted, since @fcbond clearly does not allow their redistribution, cf. https://github.com/nltk/nltk_data/pull/183#issuecomment-1042811861.

The big wordnetwiktionaryalignments-2013-02-19.tsv file is not included, since there is no handler for it.

So now, the proposed update consists in the addition of citation.bib files in the wikt and cldr folders, and 3 updated wiktionary wordnets, with the following numbers of lemmas:

2567 wn-wikt-als.tab (Tosk Albanian) 9337 wn-wikt-arb.tab (Standard Arabic) 25311 wn-wikt-spa.tab (Castilian) 37215 total

stevenbird commented 2 years ago

@ekaf: sorry for the delay. I don't want to blow away the existing zipfile with a new one, but to replace individual files. Would you please help me out with a list of the required files? Is it:

wikt/wn-wikt-als.tab
wikt/wn-wikt-arb.tab
wikt/wn-wikt-spa.tab
wikt/citation.bib
cldr/citation.bib

I'm confused because you say: "3 new wiktionary wordnets", but those 3 files already exist. Also, I see a new top-level citation.bib file apart from wikt/citation.bib and cldr/citation.bib... what should happen to that?

ekaf commented 2 years ago

@stevenbird, yes, your list is accurate. The top-level citation.bib refers to the whole OMW project and should be added as well. It is true that the old package already contains files for the 3 "new" wordnets, but the old ones have huge issues: almost all the als lemmas are misplaced into the spa file, while the arb lemmas are erroneously marked lemma:arn, which is the identifier for the Mapuche, Mapudungun language. Everything in the new package is just copied verbatim from the newer OMW-1.4 source, so there is one additional change: many of the wordnet filenames in the wikt folder contain a star sign '*', which is now replaced by an 'X'. Avoiding '*' in filenames may not always be crucial, but it is safer.

stevenbird commented 3 months ago

Hi @ekaf, @fcbond, sorry for the long delay on this! Can you please suggest the simplest way for me to get the current files? Perhaps a full drop-in replacement for NLTK's extended_omw.zip (minus anything @fcbond doesn't want included)?

ekaf commented 3 months ago

Hi @stevenbird, thanks for your interest :) Yes, this package is a drop-in update of @ExplorerFreda's original package. I think it is ok, except that there is now a newer webpage URL (https://omwn.org) to include in _extendedomw.xml. The topmost README file might also benefit from some editing.