Open DoraShao opened 8 years ago
When you import documents make sure to specify --token-regex option, using the same pattern you specified in the Java code. This option modifies the same argument that you changed. Also be sure you are using the most recent code, since the default pattern should support non-Latin characters.
Hi, I am trying to use topic modeling on some segemented chinese documents. When I ran the topic modeling example code from mallet developer guide page, I found a lot of Chinese words are ignored/escaped. As when I deleted all the chinese words the program designated as keywords from the document, though the rest of the document remain with enough content, the program would not raise any new keywords in the next run. I replaced the regex with a different one in line:
pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) );
and it resolved the issue(The program will keep producing new keywords after old ones being deleted).However when I ran topic modeling using windows command line(I used
train-topics
with--output-topic-keys
option), I found that all the Chinese characters in the input files are ignored, only the English words, abbreviations were elected as keywords. I tried with French and Japanese documents. The same thing happened for Japanese documents, and the French ones are fine.Note that I am not doing poly lingual document modeling, it's just that the Chinese documents I am working with contain English abbreviations occationally.