Only English worlds in the topics

jamesknox / topic-modeling-tool

Automatically exported from code.google.com/p/topic-modeling-tool

0 stars 0 forks source link

Only English worlds in the topics #3

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Russian language texts with several english words in UTF-8 *.txt format as 
input.
2. There are only English words in the topics, without any russian one

What version of the product are you using? On what operating system?
I used the last version from the site on 32-bit Windows XP

I've attached the text files in the archieve

Original issue reported on code.google.com by Semenoff...@gmail.com on 25 Nov 2011 at 4:59

Attachments:

TXT.zip

GoogleCodeExporter commented 8 years ago

Same problem with French which ends up split on accentuated characters. I've 
tried switching the file to UTF8 but it makes no difference. I'm running this 
on a Mac OSX.

Original comment by patrick....@gmail.com on 17 Aug 2012 at 4:33

GoogleCodeExporter commented 8 years ago

I have the same problem in Persian.

Original comment by afshinra...@gmail.com on 16 Dec 2012 at 3:21

GoogleCodeExporter commented 8 years ago

Hello All,

I tried to run the code in eclipse for understanding how it work, so it can be 
improve to hLDA. But, i got an error saying package.cs.mallet.gui is not 
found.Kindly help on how to import the file to IDE eclipse or Netbean and run 
successful.

Original comment by abiodunm...@gmail.com on 7 Apr 2013 at 4:40

GoogleCodeExporter commented 8 years ago

I have the same problem in Greek. Only the English words appear in the topics. 
In command line mallet installation i can define --token-regex "[\p{L}\p{M}]+" 
and then read UTF8 Greek. Is there a tokenization option here?

Original comment by gmik...@gmail.com on 19 Jan 2014 at 1:15

GoogleCodeExporter commented 8 years ago

Hi everyone,

I have recompiled a .jar file with an hardcoded --token-regex 
"[\\p{L}\\p{P}]*\\p{L}" option, which works well with French.

It's available at https://github.com/ulbstic/topic-modeling-tool-FR .

Have a great day,

Simon

Original comment by simon.he...@gmail.com on 29 May 2015 at 12:32