Closed turicas closed 5 years ago
I did some tests about corpora encoding in this gist (and reported results on #299). These corpora below have some interface problem (probably with methods words
or raw
):
cmudict
conll2007
ieer
ipipan
nombank
ppattach
propbank
qc
senseval
switchboard
timit
toolbox
udhr
wordnet
wordnet_ic
ycoe
You can get an iterator with all words from wordnet with wordnet.all_lemma_names()
(for English).
This could be use to provide wordnet.words()
. Note that you can also get the lemmas for other languages with, e.g. wordnet.all_lemma_names(lang='jpn')
for Japanese.
I am not sure what an appropriate value would be for wordnet.raw(), maybe " ".join(wordnet.words())
?
I agree that text corpora (e.g. shakespeare
) and wordlist corpora (e.g. cmudict
) should implement words()
. Others, like ppattach
and toolbox
, are text databases and I don't think it makes much sense for them.
@turicas – that's a nice analysis of our interface consistency.
@stevenbird, :) I think the idea behind it is to keep the same interface to reuse code, so if there are no cases where toolbox
would be used for an analysis you've made for shakespeare
then they may not be objects of the same class and don't need to share the same interface. In this case I prefer completely different class names so the difference will be explicit (Corpora
and TextDatabase
, for example).
And for some composite-objects inheritance should be avoided in favor of mix-ins.
I'm writing a function that receives a corpus object but it doesn't know which corpus is it. In this function I need to get all the words in the corpus; when I try with
machado
orstopwords
corpora it works, but not withshakespeare
. The problem is that these objects does not share the same interface - and I think it's a bug. I think all corpora methods should have the same interface regardless of how the corpus is retrieved, so we can reuse code without worrying about compatibility. The problem can be illustrated with the code below (nltk.__version__
='2.0.2'
):When reading the help of each method, I found the inconsistencies on parameters:
machado.words(self, fileids=None, categories=None)
(fromnltk.corpus.reader.plaintext.PortugueseCategorizedPlaintextCorpusReader
)stopwords.words(self, fileids=None)
(fromnltk.corpus.reader.wordlist.WordListCorpusReader
)shakespeare.words(self, fileid=None)
(fromnltk.corpus.reader.xmldocs.XMLCorpusReader
) -- even withfileid=None
, it'll raise an exception if I don't pass it.