Open lingdoc opened 6 years ago
TL;DR
To access the panlex_swadesh
:
from nltk.corpus import swadesh110, swadesh207
for lang in swadesh110.fileids():
for concept in swadesh110.words(lang):
lemmas = concept.split('\t')
The usage is similar for swadesh207
.
@stevenbird maybe it'll be good have a better panlex swadesh list API given that now the fileids
are not actual language codes/names but file paths and it's not hard for us to just put a dictionary of language code and access the list with something like:
from nltk.corpus import swadesh110, swadesh207
for lang_code in swadesh110.languages(): # Returns a list of language code.
swadesh110.lang_name(lang_code) # Returns the language name.
for words in swadesh110.entry(lang_code): # Returns a list of concepts.
print(words) # A list of words with the specific concept.
@lingdoc because there are many swadesh lists and they are basically a list of words the common, I think it was by design that the multiple swadesh lists have different names.
From https://github.com/nltk/nltk/blob/develop/nltk/corpus/__init__.py#L199:
swadesh = LazyCorpusLoader(
'swadesh', SwadeshCorpusReader, r'(?!README|\.).*', encoding='utf8')
swadesh110 = LazyCorpusLoader(
'panlex_swadesh', SwadeshCorpusReader, r'swadesh110/.*\.txt', encoding='utf8')
swadesh207 = LazyCorpusLoader(
'panlex_swadesh', SwadeshCorpusReader, r'swadesh207/.*\.txt', encoding='utf8')
The SwadeshCorpusReader
is a subclass of the WordListCorpusReader
so it has the .words()
function and the .entries()
, from https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordlist.py#L31
aha - thanks! now that you point this out it makes sense, but it's not clear from the documentation. I spent an hour or so googling, and never came across this line in "wordlist.py".
When I try to import the Panlex Swadesh word lists like this:
I get the following error:
I can access the data files in my
nltk_data
folder, and the corpus downloader says they exist and are up to date, but I can't figure out how to read them using nltk in Python. If the access method is different from other corpora, or has somehow changed, this should probably be documented somewhere.