Matthew Jockers makes a convincing argument for removing Proper Nouns and all words except nouns from a corpus before running a topic model. In our data set this is relatively simple - you do not need a pos-tagger (part-of-speech tagger), because those tags are already there. The Part of Speech tag is always the last element in a lemmatized word, for instance Assurbanipal[1]RN - where RN means royal name, or bītu[house]N (N = noun) or akālu[eat]V (V = verb). In order to select only Nouns from a Document Term Matrix you can select those column names that end in ]N - and that is all.
Matthew Jockers makes a convincing argument for removing Proper Nouns and all words except nouns from a corpus before running a topic model. In our data set this is relatively simple - you do not need a pos-tagger (part-of-speech tagger), because those tags are already there. The Part of Speech tag is always the last element in a lemmatized word, for instance
Assurbanipal[1]RN
- where RN means royal name, orbītu[house]N
(N = noun) orakālu[eat]V
(V = verb). In order to select only Nouns from a Document Term Matrix you can select those column names that end in]N
- and that is all.