nltk / nltk_data

NLTK Data
1.4k stars 1.03k forks source link

Add multilingual wordnet #9

Closed stevenbird closed 8 years ago

stevenbird commented 10 years ago

@francisbond is contributing the Open Multilingual Wordnet to NLTK (http://www.casta-net.jp/~kuribayashi/multi/).

We need to settle on a short name to use: multiwordnet?

fcbond commented 10 years ago

There is an Italian project called 'MultiWordNet' so I would like to avoid just 'multiwordnet'. How about omw?

stevenbird commented 10 years ago

OK. We're often writing "from nltk import wordnet as wn", and so wn has gained some currency as an abbreviation for WordNet.

We could have omwn. But in a world where openness is the unmarked case, we could have mwn.

Do either of these appeal or would you still prefer omw?

fcbond commented 10 years ago

G'day,

OK. We're often writing "from nltk import wordnet as wn", and so wn has

gained some currency as an abbreviation for WordNet.

We could have omwn. But in a world where openness is the unmarked case, we could have mwn.

Do either of these appeal or would you still prefer omw?

I alos like to thing of openness as the default, but 'mwn' is still a bit close to Multiwordnet. I guess omwn is ok, although I have a slight preference for 'omw'. 'wngrid' is another possibility: this is the name chosen by the global wordnet association, and we are now the current implementation.

Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

stevenbird commented 10 years ago

OK, omw it is then, thanks.

stevenbird commented 10 years ago

The list of languages in the supplied omw corpus is as follows. I think fre is spurious (a copy of fra) and we seem to be missing ind even though it is mentioned in the documentation.

als cmn eng fin fre ita mcr nor por
arb dan fas fra heb jpn msa pol tha

@fcbond would you please advise.

fcbond commented 10 years ago

The current list is as follows:

langs = ("eng", "ind", "zsm", "jpn", "tha", "cmn", "qcn", "fas", "arb", "heb", "ita", "por", "nob", "nno", "dan", "swe", "fra", "fin", "ell", "glg", "cat", "spa", "eus", "als", "pol", "slv")

We use qcn for traditional Chinese (and the slightly differently designed NTU, Taiwan Chinese Wordnet).

We will try to upload a new omw.zip sometime today.

t = dd(lambda: dd(unicode))

thing, lang, = label

t['eng']['eng'] = 'English' t['eng']['ind'] = 'Inggeris' t['eng']['zsm'] = 'Inggeris' t['ind']['eng'] = 'Indonesian' t['ind']['ind'] = 'Bahasa Indonesia' t['ind']['zsm'] = 'Bahasa Indonesia' t['zsm']['eng'] = 'Malaysian' t['zsm']['ind'] = 'Bahasa Malaysia' t['zsm']['zsm'] = 'Bahasa Malaysia' t['msa']['eng'] = 'Malay'

t["swe"]["eng"] = "Swedish"; t["ell"]["eng"] = "Greek"; t["cmn"]["eng"] = "Chinese (simplified)"; t["qcn"]["eng"] = "Chinese (traditional)"; t['eng']['cmn'] = u'英语' t['cmn']['cmn'] = u'汉语' t['qcn']['cmn'] = u'漢語' t['cmn']['qcn'] = u'汉语' t['qcn']['qcn'] = u'漢語' t['jpn']['cmn'] = u'日语' t['jpn']['qcn'] = u'日语'

t['als']['eng'] = 'Albanian' t['arb']['eng'] = 'Arabic' t['cat']['eng'] = 'Catalan' t['dan']['eng'] = 'Danish' t['eus']['eng'] = 'Basque' t['fas']['eng'] = 'Farsi' t['fin']['eng'] = 'Finnish' t['fra']['eng'] = 'French' t['glg']['eng'] = 'Galician' t['heb']['eng'] = 'Hebrew' t['ita']['eng'] = 'Italian' t['jpn']['eng'] = 'Japanese' t['mkd']['eng'] = 'Macedonian' t['nno']['eng'] = 'Nynorsk' t['nob']['eng'] = u'Bokmål' t['pol']['eng'] = 'Polish' t['por']['eng'] = 'Portuguese' t['slv']['eng'] = 'Slovene' t['spa']['eng'] = 'Spanish' t['tha']['eng'] = 'Thai'

franquattri commented 9 years ago

Hi, got the same problem that somebody posted on Quora some months ago: "I can call: from nltk.corpus import sinica_treebank

but when i call from nltk.corpus import omw The result is: cannot import name omw No module named omw. "

I checked the downloader and the omw is installed. I am using Python 2.7. Other modules work fine. Any clues? Thanks in advance.

franquattri commented 9 years ago

One just needed to read the NLTK cookbook more accurately. You don't need to import the module 'omw', but you can recall it directly by simply importing wordnet (wn). More under: http://www.nltk.org/howto/wordnet.html

alvations commented 9 years ago

A user reported missing spanish lemmas from OMW: http://stackoverflow.com/questions/26474731/missing-spanish-wordnet-from-nltk/26494099#26494099

DarrenCook commented 9 years ago

@franquattri It would be useful if the howto showed full installation instructions. On Ubuntu 14.04, with the data URL fixed (http://askubuntu.com/a/527408/93794), I have wordnet and omw installed (I see them under ~/nltk_data/corpora), but when I follow through http://www.nltk.org/howto/wordnet.html a lot of the examples fail, in particular wn.langs() fails with "AttributeError: 'WordNetCorpusReader' object has no attribute 'langs'". Is that manual for a specific version?

franquattri commented 9 years ago

Hi Darren, The manual has been updated to the NLTK 3.0 version but it should work fine with the previous NLTK versions too. I'm working with Windows, Python 2.7 and iPython (which I suggest also for Unicode matters) Both attempts work for me:
from nltk.corpus import wordnet as wn wn.langs() and

from nltk.corpus import wordnet as wn sorted(wn.langs()) # as showed here http://www.nltk.org/howto/wordnet.html

Can you be more specific about the examples that fail?

alvations commented 9 years ago

@DarrenCook, there are discrepancies between the API, the documentation and the nltk_data but i'm sure the OMW team will fix it and the documentation will follow shortly.

Please note that catalan seem to be missing from the wn.langs() although it's in the MCR.

>>> import nltk
>>> nltk.__version__
'3.0.0'
>>> nltk.download('omw')
[nltk_data] Downloading package omw to /home/alvas/nltk_data...
[nltk_data]   Package omw is already up-to-date!
True

>>> from nltk.corpus import wordnet as wn
>>> wn.langs()
[u'als', u'arb', u'cmn', u'dan', u'eng', u'fas', u'fin', u'fra', u'fre', u'heb', u'ita', u'jpn', u'cat', u'eus', u'glg', u'spa', u'ind', u'zsm', u'nno', u'nob', u'pol', u'por', u'tha']
>>> exit()
alvas@ubi:~$ cd ~/nltk_data/corpora/omw/
alvas@ubi:~/nltk_data/corpora/omw$ ls
als  cmn  eng  fin  fre  ita  mcr  nor  por     tha
arb  dan  fas  fra  heb  jpn  msa  pol  README

alvas@ubi:~/nltk_data/corpora/omw$ cd mcr/
alvas@ubi:~/nltk_data/corpora/omw/mcr$ ls
LICENSE     wn-data-cat.tab  wn-data-glg.tab  wn-data-spa.tab.gz
mcr2tab.py  wn-data-eus.tab  wn-data-spa.tab
DarrenCook commented 9 years ago

nltk.version '2.0b9'

Is that too old?

(apt-get install python-nltk tells me "python-nltk is already the newest version.")

Working through the examples, the first one that fails is "print(wn.synset('dog.n.01').definition())", which says "TypeError: 'str' object is not callable". The three commands before that worked fine.

alvations commented 9 years ago

Using pip install -U nltk would update to 3.0.0. apt-get is still holding the older version.

With regards to accessing synsets from the wordnet API in NLTK, i think the major change would be https://github.com/nltk/nltk/commit/ba8ab7e23ea2b8d61029484098fd62d5986acd9c

Possibly you'll find errors from nltk.download() too, if you're using the apt-get branch of NLTK, see http://askubuntu.com/questions/527388/python-nltk-on-ubuntu-12-04-lts-nltk-downloadbrown-results-in-html-error-40

See also: Change Log: https://github.com/nltk/nltk/blob/develop/ChangeLog API Changes: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0

franquattri commented 9 years ago

@DarrenCook you sure you have installed NLTK correctly? you can take a look here: http://www.nltk.org/install.html

To find out which nltk version you have: import nltk nltk.version

to update NLTK / modules (for windows) > Command Prompt > python -m pip install -upgrade SomePackage

Are you using the WN version that comes with NLTK (WN 3.0) or the newest release (i.e.have you imported it in NLTK)? There might be some issues for that reason as well.

DarrenCook commented 9 years ago

Thanks @alvations and Francesca for your help. These two commands got everything working:

sudo apt-get install python-pip sudo pip install -U nltk

@franquattri I think I may have downloaded the latest wordnet, while having the 2.0b9 of nltk installed, so maybe that was the issue.

franquattri commented 9 years ago

Hi, does Anybody know of multilingual framenets (apart from the English FrameNet) that can be searched with nltk?

bryant1410 commented 8 years ago

This is already done, doesn't it?

stevenbird commented 8 years ago

Thanks @bryant1410. Yes, this is resolved.

nicoleljc1227 commented 7 years ago

i download cow from http://globalwordnet.org/wordnets-in-the-world/ to process Chinese. How can i use cow in python? for example, from nltk.corpus import wordnet as wn then how can i use cow?

fcbond commented 7 years ago

cow is already included in omw (open multilingual wordnet), so if you download that from the normal download interface, you can access cow with lang='cmn': e.g. for Japanese wn.synsets('dog')[0].lemmas(lang='jpn') [Lemma('dog.n.01.イヌ'), Lemma('dog.n.01.ドッグ'), Lemma('dog.n.01.洋犬'), Lemma('dog.n.01.犬')]

On Thu, Apr 13, 2017 at 9:24 AM, nicoleljc1227 notifications@github.com wrote:

i download cow from http://globalwordnet.org/wordnets-in-the-world/ to process Chinese. How can i use cow in python? for example, from nltk.corpus import wordnet as wn then how can i use cow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nltk/nltk_data/issues/9#issuecomment-293894095, or mute the thread https://github.com/notifications/unsubscribe-auth/ABD8xvdE1LZQNWx7VvZpiW5VZ6aToWJrks5rviH6gaJpZM4BJSpt .

-- Francis Bond http://www3.ntu.edu.sg/home/fcbond/ Division of Linguistics and Multilingual Studies Nanyang Technological University

tvrbanec commented 4 years ago

Can we use wn.synsets('dog')[0].lemmas(lang='jpn') in a way of using more than one language, ie wn.synsets('dog')[0].lemmas(lang='jpn, ita')?