Second, by showing only the ten or so highest-weight words for each topic, such presentations neglect most of the words that contribute to the topics’ roles in representing the corpus documents. For example, in the 200-topic model that we constructed from 665 non-fiction English-language books read by Charles Darwin between 1837 and 1860 (Murdock, Allen, and DeDeo 2017), typically 500-600 words are required to account for 50% of the probability mass for any given topic. Looking only at the first ten or twenty words may provide little understanding of why that topic has been assigned a high weight for a given document.
Following Allen & Murdock (2020):