Corpora should have the same interface

turicas commented 12 years ago

I'm writing a function that receives a corpus object but it doesn't know which corpus is it. In this function I need to get all the words in the corpus; when I try with machado or stopwords corpora it works, but not with shakespeare. The problem is that these objects does not share the same interface - and I think it's a bug. I think all corpora methods should have the same interface regardless of how the corpus is retrieved, so we can reuse code without worrying about compatibility. The problem can be illustrated with the code below (nltk.__version__ = '2.0.2'):

>>> from nltk.corpus import shakespeare, machado, stopwords
>>> temp = machado.words() #works
>>> temp2 = stopwords.words() #works
>>> temp3 = shakespeare.words() #doesn't work!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/alvaro/.virtualenvs/bug-nltk/local/lib/python2.7/site-packages/nltk/corpus/reader/xmldocs.py", line 63, in words
    elt = self.xml(fileid)
  File "/home/alvaro/.virtualenvs/bug-nltk/local/lib/python2.7/site-packages/nltk/corpus/reader/xmldocs.py", line 44, in xml
    raise TypeError('Expected a single file identifier string')
TypeError: Expected a single file identifier string

When reading the help of each method, I found the inconsistencies on parameters:

machado.words(self, fileids=None, categories=None) (from nltk.corpus.reader.plaintext.PortugueseCategorizedPlaintextCorpusReader)
stopwords.words(self, fileids=None) (from nltk.corpus.reader.wordlist.WordListCorpusReader)
shakespeare.words(self, fileid=None) (from nltk.corpus.reader.xmldocs.XMLCorpusReader) -- even with fileid=None, it'll raise an exception if I don't pass it.

turicas commented 12 years ago

I did some tests about corpora encoding in this gist (and reported results on #299). These corpora below have some interface problem (probably with methods words or raw):

cmudict
conll2007
ieer
ipipan
nombank
ppattach
propbank
qc
senseval
switchboard
timit
toolbox
udhr
wordnet
wordnet_ic
ycoe

fcbond commented 9 years ago

You can get an iterator with all words from wordnet with wordnet.all_lemma_names() (for English). This could be use to provide wordnet.words(). Note that you can also get the lemmas for other languages with, e.g. wordnet.all_lemma_names(lang='jpn') for Japanese.

I am not sure what an appropriate value would be for wordnet.raw(), maybe " ".join(wordnet.words())?

stevenbird commented 9 years ago

I agree that text corpora (e.g. shakespeare) and wordlist corpora (e.g. cmudict) should implement words(). Others, like ppattach and toolbox, are text databases and I don't think it makes much sense for them.

stevenbird commented 9 years ago

@turicas – that's a nice analysis of our interface consistency.

turicas commented 9 years ago

@stevenbird, :) I think the idea behind it is to keep the same interface to reuse code, so if there are no cases where toolbox would be used for an analysis you've made for shakespeare then they may not be objects of the same class and don't need to share the same interface. In this case I prefer completely different class names so the difference will be explicit (Corpora and TextDatabase, for example). And for some composite-objects inheritance should be avoided in favor of mix-ins.

nltk / nltk

Corpora should have the same interface #296