mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
989 stars 344 forks source link

non eng char in input ignored where running Topic Model in Command Line #79

Open DoraShao opened 8 years ago

DoraShao commented 8 years ago

Hi, I am trying to use topic modeling on some segemented chinese documents. When I ran the topic modeling example code from mallet developer guide page, I found a lot of Chinese words are ignored/escaped. As when I deleted all the chinese words the program designated as keywords from the document, though the rest of the document remain with enough content, the program would not raise any new keywords in the next run. I replaced the regex with a different one in line: pipeList.add( new CharSequence2TokenSequence(Pattern.compile("\\p{L}[\\p{L}\\p{P}]+\\p{L}")) ); and it resolved the issue(The program will keep producing new keywords after old ones being deleted).

However when I ran topic modeling using windows command line(I used train-topics with --output-topic-keys option), I found that all the Chinese characters in the input files are ignored, only the English words, abbreviations were elected as keywords. I tried with French and Japanese documents. The same thing happened for Japanese documents, and the French ones are fine.

Note that I am not doing poly lingual document modeling, it's just that the Chinese documents I am working with contain English abbreviations occationally.

mimno commented 8 years ago

When you import documents make sure to specify --token-regex option, using the same pattern you specified in the Java code. This option modifies the same argument that you changed. Also be sure you are using the most recent code, since the default pattern should support non-Latin characters.