nltk / nltk_data

NLTK Data
1.45k stars 1.04k forks source link

Upgrade omw to 1.4 #171

Closed ekaf closed 2 years ago

ekaf commented 2 years ago

This new version of the omw package contains the multilingual wordnets from the wns folder in the new OMW 1.4 release, retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip

To take advantage of the new features, this package needs upgraded OMW-related functions (https://github.com/nltk/nltk/pull/2899) in NLTK's wordnet.py module.

goodmami commented 2 years ago

It's probably not the best idea to just package the wns/ subdirectory from OMW. Specifically:

The wordnets included in OMW 1.4 that aren't currently in the NLTK are:

It might be better to hand select the subdirectories of wns/ that should be included in the NLTK.

ekaf commented 2 years ago

Thanks @goodmami! I completely agree about removing unnecessary subdirectories from the omw package. However, these directories are part of the recent release 1.4 of omw-data (https://github.com/omwn/omw-data), so they were deemed good-to-go by the upstream data provider. Maybe your concerns could be more effectively addressed there.

Looking at the specific details, the four languages mentioned as not included in this package (isl, iwn, slk and rom) are indeed there, as shown in the NLTK output below. It appears that "rom" is called "ron" in OMW-data, and if this is a problem, it should be addressed upstream.

Assuming the updated OMW functions from the wordnet.py PR #https://github.com/nltk/nltk/pull/2899:

import nltk
from nltk.corpus import wordnet31 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1

print(wn.langs()) dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'qcn', 'dan', 'ell', 'eng_eng', 'fas', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])

ss = wn.synset('dog.n.01')
for lg in ["isl","ita_iwn","ron","slk"]:
    print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")

isl lemmas:[Lemma('dog.n.01.rakki'), Lemma('dog.n.01.seppi'), Lemma('dog.n.01.hundur'), Lemma('dog.n.01.hvutti')] ita_iwn lemmas:[Lemma('dog.n.01.cane')] slk lemmas:[Lemma('dog.n.01.pes'), Lemma('dog.n.01.pesdomáci(Canis_familiaris)')] ron lemmas:[Lemma('dog.n.01.câine')]

I agree that the licensing terms of the Farsi Wordnet could be clearer: print(wn.license(lang='fas'))

mortaza montazery mortaza.gh@gmail.com 9 Jan to Francis

This version of Persian WordNet is the result of my research on automatic construction of WordNet. It's free and you can use it everywhere you want.

But here again, upstream has found these terms ok. Otherwise, you could consider having this language removed upstream.

Concerning the 'cwn/', it contains the same open-source license as PWN, so where is the problem?

The cldr/ and wikt/ subdirectories do not contain any wn-data-lang.tab file, so they are just ignored by the wordnet module. However, especially the wordnets extracted from Wikipedia seem very interesting, and ok to include as experimental data for the more curious users.

In conclusion, it is indeed possible to manually remove some subdirectories, but these do no harm, as they are correctly handled (or ignored) by the wordnet.py PR #https://github.com/nltk/nltk/pull/2899.

I propose to eventually remove "eng", "en30", "en31", "pwn31", (and maybe "fas"), but to keep "cwn", "cldr" and "wikt".

fcbond commented 2 years ago

Hi,

I think @goodmami and I are equally upstream in this case.

On Thu, Dec 2, 2021 at 3:18 PM Eric Kafe @.***> wrote:

Thanks @goodmami https://github.com/goodmami! I completely agree about removing unnecessary subdirectories from the omw package. However, these directories are part of the recent release 1.4 of omw-data ( https://github.com/omwn/omw-data), so they were deemed good-to-go by the upstream data provider. Maybe your concerns could be more effectively addressed there.

The tab files in questions are kept in the repository in case people want to access them for historical purposes, but they are not recommended to use, and are not released as LMF: https://github.com/omwn/omw-data/releases/tag/v1.4

Looking at the specific details, the four languages mentioned as not included in this package (isl, iwn, slk and rom) are indeed there, as shown in the NLTK output below. It appears that "rom" is called "ron" in OMW-data, and if this is a problem, it should be addressed upstream.

Assuming the updated OMW functions from the wordnet.py PR #nltk/nltk#2899 https://github.com/nltk/nltk/pull/2899:

import nltk

from nltk.corpus import wordnet31 as wn

print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 3.1

print(wn.langs()) dict_keys(['eng', 'als', 'arb', 'bul', 'cmn', 'qcn', 'dan', 'ell', 'eng_eng', 'fas', 'fin', 'fra', 'heb', 'hrv', 'isl', 'ita', 'ita_iwn', 'jpn', 'cat', 'eus', 'glg', 'spa', 'ind', 'zsm', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'lit', 'slk', 'slv', 'swe', 'tha'])

ss = wn.synset('dog.n.01')

for lg in ["isl","ita_iwn","ron","slk"]:

print(f"{lg} lemmas:{ss.lemmas(lang=lg)}")

isl lemmas:[Lemma('dog.n.01.rakki'), Lemma('dog.n.01.seppi'), Lemma('dog.n.01.hundur'), Lemma('dog.n.01.hvutti')] ita_iwn lemmas:[Lemma('dog.n.01.cane')] slk lemmas:[Lemma('dog.n.01.pes'), Lemma('dog.n.01.pesdomáci(Canis_familiaris)')] ron lemmas:[Lemma('dog.n.01.câine')]

I agree that the licensing terms of the Farsi Wordnet could be clearer: print(wn.license(lang='fas'))

mortaza montazery @.*** 9 Jan to Francis

This version of Persian WordNet is the result of my research on automatic construction of WordNet. It's free and you can use it everywhere you want.

But here again, upstream has found these terms ok. Otherwise, you could consider having this language removed upstream.

Concerning the 'cwn/', it contains the same open-source license as PWN, so where is the problem?

Native speakers of Farsi and Mandarin have pointed out that these two resources have some quality issues, as well as the licensing issues for Farsi, which is why we have removed them from the recommended release. Please do not export them to NLTK.

The cldr/ and wikt/ subdirectories do not contain any wn-data-lang.tab file, so they are just ignored by the wordnet module. However, especially the wordnets extracted from Wikipedia seem very interesting, and ok to include as experimental data for the more curious users.

I am fine with that.

In conclusion, it is indeed possible to manually remove some subdirectories, but these do no harm, as they are correctly handled (or ignored) by the wordnet.py PR #nltk/nltk#2899 https://github.com/nltk/nltk/pull/2899.

I propose to eventually remove "eng", "en30", "en31", "pwn31", (and maybe "fas"), but to keep "cwn", "cldr" and "wikt".

pwn31 has licensing issues so should be removed, and eng, en30 and en31 are pointless to include. 'fas' and 'cwn' should be removed because of the quality issues.

Yours,

-- Francis Bond https://fcbond.github.io/ Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami commented 2 years ago

However, these directories are part of the recent release 1.4 of omw-data (https://github.com/omwn/omw-data), so they were deemed good-to-go by the upstream data provider.

When Francis and I created the release on GitHub, GitHub automatically included the full repository source as .zip and .tar.gz files attached to the release (I'm not sure how to prevent these files from being created). Please don't take those files to be the contents of the release. Instead look at the wordnets specifically listed here: https://github.com/omwn/omw-data/releases/tag/v1.4.

Looking at the specific details, the four languages mentioned as not included in this package (isl, iwn, slk and rom) are indeed there, as shown in the NLTK output below. It appears that "rom" is called "ron" in OMW-data, and if this is a problem, it should be addressed upstream.

Sorry, my mistake regarding rom; ron is correct. Regarding the presence of these languages, perhaps you are using your development version of nltk_data to inspect this, because I don't see them:

>>> from nltk.corpus import wordnet as wn
>>> wn.langs()
['eng', 'als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eus', 'fas', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'ita', 'jpn', 'nld', 'nno', 'nob', 'pol', 'por', 'qcn', 'slv', 'spa', 'swe', 'tha', 'zsm']
>>> wn.synsets('plantă', lang='ron')
Traceback (most recent call last):
...
nltk.corpus.reader.wordnet.WordNetError: Language is not supported.

I also downloaded the omw.zip file from this repository and I don't see the four languages in the archive.

The cldr/ and wikt/ subdirectories do not contain any wn-data-lang.tab file, so they are just ignored by the wordnet module. However, especially the wordnets extracted from Wikipedia seem very interesting, and ok to include as experimental data for the more curious users.

I am fine with that.

Ok, but if they are just ignored, then what's the benefit? It will increase the disk/bandwidth requirements for omw.zip and users won't be able to see the data unless they dig around their nltk_data folder. Why not just direct those curious users to the omw-data repository? However, I do see .tab files under cldr/ and wikt/, so I'm a bit confused by your statement that there are none.

ekaf commented 2 years ago

@goodmami, there are tab files in cldr and wikt, but wordnet.py ignores them because they are named with a different pattern than wn-data-*.tab. The wiktionary data seems interesting, @fcbond is "fine with that", and 26 MB is not an overwhelming amount nowadays. But I agree that removing these files would yield a leaner package. So, if you prefer, I could remove cldr and wikt from the proposed nltk_data package, in addition to the following:

pwn31 has licensing issues so should be removed, and eng, en30 and en31 are pointless to include. 'fas' and 'cwn' should be removed because of the quality issues.

goodmami commented 2 years ago

there are tab files in cldr and wikt, but wordnet.py ignores them because they are named with a different pattern than wn-data-*.tab

Ok thanks, I see the issue now.

So, if you prefer, I could remove cldr and wikt from the proposed nltk_data package, in addition to the following: ...

The extra bits wouldn't bother me as much if they were usable by the NLTK. I think the filename issue is easily resolved with some changes to the NLTK code, but otherwise I would prefer to exclude the data if they are simply dead weight in the package. A second issue is that the Wiktionary and CLDR files are not in fact part of the OMW 1.4 release (they were not packaged as WN-LMF files). It might make sense to put them in a separate package, e.g., omw-extra. But I'll leave the issue of their inclusion up to you and @fcbond.

ekaf commented 2 years ago

Thanks @goodmami and @fcbond! I have updated the package according to all your suggestions.

Here are new statistics, using OEWN 2021 and the updated wordnet.py (https://github.com/nltk/nltk/pull/2899):

import nltk
from nltk.corpus import wordnet2021 as wn
print(f"Wordnet v. {wn.get_version()}\n")

Wordnet v. 2021

for lg in sorted(wn.langs()):
    print(f"{lg}: {len(list(wn.words(lang=lg)))} words")

als: 5988 words arb: 17785 words bul: 6720 words cat: 46531 words cmn: 61533 words dan: 4468 words ell: 18225 words eng: 151770 words eus: 26240 words fin: 129839 words fra: 55351 words glg: 23124 words heb: 5325 words hrv: 29008 words ind: 36954 words isl: 11504 words ita: 41855 words ita_iwn: 19221 words jpn: 91964 words lit: 11395 words nld: 43077 words nno: 3387 words nob: 4186 words pol: 45387 words por: 54071 words ron: 49987 words slk: 29150 words slv: 40230 words spa: 36681 words swe: 5824 words tha: 82504 words zsm: 33932 words

ekaf commented 2 years ago

This allows interesting comparisons with the Wn module. Having two different modules that accomplish the same tasks could make verification and debugging easier.

goodmami commented 2 years ago

Thanks, @ekaf, that looks better. A couple other things I noticed, in comparison to the current package distributed by the NLTK:

ekaf commented 2 years ago

@goodmami it sounds like you are looking at the old omw package instead of the new one proposed in this PR.

goodmami commented 2 years ago

I was comparing the zip file in your fork (https://raw.githubusercontent.com/ekaf/nltk_data/omw14/packages/corpora/omw.zip) to what I currently have in ~/nltk_data/corpora/omw/. For example (omw14 below is from the omw.zip in your fork):

$ ls omw14/arb/
arb2tab.py  awnCopyright.pdf  citation.bib  LICENSE  README  wn-data-arb.tab  wn-nodia-arb.tab
$ ls nltk_data/corpora/omw/arb/
citation.bib  LICENSE  README  wn-data-arb.tab

Please note that the wns/ directory in the omw-data repository is not a ready-to-go dataset, but rather it contains the source files used in creating the OMW 1.4 release (which for us is the XML files), as well as the historical artifacts of the creation of those tab files. My suggestion is to not include anything but citation.bib, LICENSE, README, and wn-data-....tab, as the other files are irrelevant for NLTK users and were not included in the initial and current NLTK package.

I'm ambivalent about splitting out the directories with multiple .tab files, like mcr. They were split previously, but it's not a big deal to remain together if they can still be loaded in the NLTK. If they are split, the metadata (LICENSE, README, citation.bib) should be copied to each subdirectory, as is done currently:

$ ls omw14/mcr/
citation.bib  LICENSE  mcr2tab.py  wn-data-cat.tab  wn-data-eus.tab  wn-data-glg.tab  wn-data-spa.tab
$ ls nltk_data/corpora/omw/cat/
citation.bib  LICENSE  wn-data-cat.tab
ekaf commented 2 years ago

@goodmami, I don't understand why you mention splitting up the mcr folder. This was done previously, but it is not split in the new omw14 package, since this is just identical with the 'wns' folder in the original 1.4 release of OMW-data (https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip), except that the agreed problematic subdirectories have been removed.

As you write,"wns" is the source directory, so it is just what we need for an open-source project. The intent of this PR is not to duplicate the LMF release, but to give users of NLTK's current omw package access to the updated OMW data.

Concerning the two supplementary files in the 'arb' folder, they really are no big deal, and might interest some users. Otherwise, why did you include them in your OMW 1.4 package? Maybe you could consider releasing an upstream package that better conforms with your wishes.

fcbond commented 2 years ago

G'day,

The scripts for making the wordnets are there for people who want to make more (or extract more), not the average nltk user.   I don't think they should be packaged up as data. People who want to see them should be downloading the whole omw-datarepository.

Here is my (out of date) script for packaging the NLTK release which just copies the files I consider relevant, and does a little clean up of the data. It does not expect two wordnets for the same language, so loses an Italian wordnet, ...

I stopped maintaining it as I think my time is better spent getting the new version of OMW working, but I agree is good to keep the current version as u-to-date as possible, so thanks!

The URL for the project should change to https://omwn.org/, as compling will be retired soon. I plan to start making the new site useful from next week.

#! /bin/bash

RELEASE=release/nltk
OMW="$RELEASE"/omw

echo Copying files to nltkdata.zip
rm -rf "$RELEASE"

mkdir -p "$OMW"
cp wns/citation.bib "$OMW"/. 
#cp www/LICENSE "$OMW"/.
cat wns/eng/citation.bib >> "$OMW"/citation.bib

for f in `ls wns/*/wn-data*.tab` 
do
    drsn=`dirname $f`
    proj=`basename $drsn`
    file=`basename $f .tab`
     ### EXCLUDING WIKT, CLDR and English Data
    if [[ $proj != wikt &&   $proj != cldr && $proj != eng ]]
    then
    lang=${file:8}    
    #echo $proj $lang $file
    ### copy stuff
    echo adding ${lang} from ${proj} to zip
    echo making dir
    mkdir -p "$OMW"/${lang}
    echo copy data
    ### copy and keep only first line
    head -1 ${f} > "$OMW"/${lang}/${file}.tab
    ### copy lemmas, sort and keep only unique
    grep -P "^[^\t]+\t[^\t]*lemma\t[^\t]+$" ${f}  | sort -u  \
        >> "$OMW"/${lang}/${file}.tab
    cp wns/$proj/LICENSE "$OMW"/${lang}/.
    cp wns/$proj/citation.bib  "$OMW"/${lang}/.
    cat wns/$proj/citation.bib  >> "$OMW"/citation.bib
    cp wns/$proj/README "$OMW"/${lang}/README
    cp wns/$proj/Readme "$OMW"/${lang}/README
    fi
done

wns=`ls "$OMW"/*/wn-data*.tab | cut -d/  -f 3`
numwns=`ls "$OMW"/*/wn-data*.tab | cut -d/  -f 3 | wc -l`
today=`date --rfc-3339=date`
cat <<EOT >> "$OMW"/README
This data is formatted by the Open Multilingual Wordnet Project

to be used by NLTK.

Please cite us if you find the aggregation useful (see citation.bib)
and email us if you have any suggestions.

Francis Bond (bond@ieee.org)
${today}

EOT
echo "${numwns} languages covered (and we assume you have English):" >> "$OMW"/README
echo  "${wns} "  >> "$OMW"/README
echo  >> "$OMW"/README

pushd "$RELEASE"
zip -r9 omw.zip omw/*
#rm -rf omw
popd

echo "omw.zip created in $RELEASE"
goodmami commented 2 years ago

...since this is just identical with the 'wns' folder in the original 1.4 release of OMW-data (https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip)...

It seems there is still some confusion about what the 1.4 release is. As I mentioned above, the .zip file you're looking at is not the release. It is simply an archive, automatically created by GitHub, of the state of the repository when we created the release. I would remove that file from the release page if I could. Since I cannot, I have now updated the text on the release page to make this more clear.

Maybe you could consider releasing an upstream package that better conforms with your wishes.

We have. It contains the WN-LMF files :wink: https://github.com/omwn/omw-data/releases/download/v1.4/omw-1.4.tar.xz

Again, thanks for your efforts to update the OMW data in the NLTK. Hopefully it's now more clear which files we consider user-facing and which are just meant for OMW maintainers. It sounds like Francis and I should update the omw-data project README, as well.

ekaf commented 2 years ago

Thanks for your packaging script @fcbond: it was very useful as a starting point. However, it doesn't handle multiple wordnets for the same language. For this it is necessary to differentiate between language and project names, so the new directory structure is more future-proof. Also, some wordnets now include definition and example triples. So I rebuilt omw.zip using an adapted script.

Thanks for your patience @goodmami!

Edited: corrected a bug in the packaging script, that duplicated the first line in each wordnet. Here is the new script: Omw2nltk.sh.txt

fcbond commented 2 years ago

Great,

glad my script helped @ekaf

stevenbird commented 2 years ago

@ekaf, @fcbond, @goodmami – it looks like you have consensus now... am I good to merge this?

ekaf commented 2 years ago

@stevenbird, I also think it looks ok now

fcbond commented 2 years ago

Yes I think it's ready to go.

On Mon, 6 Dec 2021, 7:12 pm Eric Kafe, @.***> wrote:

@stevenbird https://github.com/stevenbird, I also think it looks ok now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/pull/171#issuecomment-986676003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIPZRQIZLGQA7ILOBGSN4TUPSLBTANCNFSM5I7EDYPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

goodmami commented 2 years ago

I inspected the latest omw.zip from @ekaf's fork, and it looks good to me!

Thanks, all!

stevenbird commented 2 years ago

Thanks @ekaf, @fcbond, @goodmami!