nltk / nltk_data

NLTK Data
1.43k stars 1.03k forks source link

error when trying to import panlex_swadesh #117

Open lingdoc opened 6 years ago

lingdoc commented 6 years ago

When I try to import the Panlex Swadesh word lists like this:

>>> from nltk.corpus import panlex_swadesh

I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name panlex_swadesh

I can access the data files in my nltk_data folder, and the corpus downloader says they exist and are up to date, but I can't figure out how to read them using nltk in Python. If the access method is different from other corpora, or has somehow changed, this should probably be documented somewhere.

alvations commented 6 years ago

TL;DR

To access the panlex_swadesh:

from nltk.corpus import swadesh110, swadesh207

for lang in swadesh110.fileids():
    for concept in swadesh110.words(lang):
        lemmas = concept.split('\t')

The usage is similar for swadesh207.


@stevenbird maybe it'll be good have a better panlex swadesh list API given that now the fileids are not actual language codes/names but file paths and it's not hard for us to just put a dictionary of language code and access the list with something like:

from nltk.corpus import swadesh110, swadesh207

for lang_code in swadesh110.languages(): # Returns a list of language code.
    swadesh110.lang_name(lang_code) # Returns the language name.
    for words in swadesh110.entry(lang_code):   # Returns a list of concepts. 
         print(words) # A list of words with the specific concept.

@lingdoc because there are many swadesh lists and they are basically a list of words the common, I think it was by design that the multiple swadesh lists have different names.

From https://github.com/nltk/nltk/blob/develop/nltk/corpus/__init__.py#L199:

swadesh = LazyCorpusLoader(
    'swadesh', SwadeshCorpusReader, r'(?!README|\.).*', encoding='utf8')
swadesh110 = LazyCorpusLoader(
    'panlex_swadesh', SwadeshCorpusReader, r'swadesh110/.*\.txt', encoding='utf8')
swadesh207 = LazyCorpusLoader(
    'panlex_swadesh', SwadeshCorpusReader, r'swadesh207/.*\.txt', encoding='utf8')

The SwadeshCorpusReader is a subclass of the WordListCorpusReader so it has the .words() function and the .entries(), from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordlist.py#L31

lingdoc commented 6 years ago

aha - thanks! now that you point this out it makes sense, but it's not clear from the documentation. I spent an hour or so googling, and never came across this line in "wordlist.py".