MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
I think there's an issue with assigning initial topics. On Line 248 the loop index is the size of the topics array. It should be the length of the token sequence. Because the topic array has a minimum capacity (currently 2) there are always at least some topics added, even if the document has fewer than 2 tokens.
Thank you for spotting this! I fixed this and a few other instances in the topic model. I'm not sure the extra topics were ever being used for anything except reports, but I'm glad this is fixed.
I think there's an issue with assigning initial topics. On Line 248 the loop index is the size of the topics array. It should be the length of the token sequence. Because the topic array has a minimum capacity (currently 2) there are always at least some topics added, even if the document has fewer than 2 tokens.