non eng char in input ignored where running Topic Model in Command Line

Hi, I am trying to use topic modeling on some segemented chinese documents. When I ran the topic modeling example code from mallet developer guide page, I found a lot of Chinese words are ignored/escaped. As when I deleted all the chinese words the program designated as keywords from the document, though the rest of the document remain with enough content, the program would not raise any new keywords in the next run. I replaced the regex with a different one in line: pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) ); and it resolved the issue(The program will keep producing new keywords after old ones being deleted).

However when I ran topic modeling using windows command line(I used train-topics with --output-topic-keys option), I found that all the Chinese characters in the input files are ignored, only the English words, abbreviations were elected as keywords. I tried with French and Japanese documents. The same thing happened for Japanese documents, and the French ones are fine.

Note that I am not doing poly lingual document modeling, it's just that the Chinese documents I am working with contain English abbreviations occationally.

mimno / Mallet

non eng char in input ignored where running Topic Model in Command Line #79