niekveldhuis / Digital-Assyriology

Tools and Examples for Computational Text Analysis for Assyriologists.
11 stars 2 forks source link

topic modeling and pos-tags #12

Closed niekveldhuis closed 8 years ago

niekveldhuis commented 8 years ago

Matthew Jockers makes a convincing argument for removing Proper Nouns and all words except nouns from a corpus before running a topic model. In our data set this is relatively simple - you do not need a pos-tagger (part-of-speech tagger), because those tags are already there. The Part of Speech tag is always the last element in a lemmatized word, for instance Assurbanipal[1]RN - where RN means royal name, or bītu[house]N (N = noun) or akālu[eat]V (V = verb). In order to select only Nouns from a Document Term Matrix you can select those column names that end in ]N - and that is all.