Open GoogleCodeExporter opened 8 years ago
Same problem with French which ends up split on accentuated characters. I've
tried switching the file to UTF8 but it makes no difference. I'm running this
on a Mac OSX.
Original comment by patrick....@gmail.com
on 17 Aug 2012 at 4:33
I have the same problem in Persian.
Original comment by afshinra...@gmail.com
on 16 Dec 2012 at 3:21
Hello All,
I tried to run the code in eclipse for understanding how it work, so it can be
improve to hLDA. But, i got an error saying package.cs.mallet.gui is not
found.Kindly help on how to import the file to IDE eclipse or Netbean and run
successful.
Original comment by abiodunm...@gmail.com
on 7 Apr 2013 at 4:40
I have the same problem in Greek. Only the English words appear in the topics.
In command line mallet installation i can define --token-regex "[\p{L}\p{M}]+"
and then read UTF8 Greek. Is there a tokenization option here?
Original comment by gmik...@gmail.com
on 19 Jan 2014 at 1:15
Hi everyone,
I have recompiled a .jar file with an hardcoded --token-regex
"[\\p{L}\\p{P}]*\\p{L}" option, which works well with French.
It's available at https://github.com/ulbstic/topic-modeling-tool-FR .
Have a great day,
Simon
Original comment by simon.he...@gmail.com
on 29 May 2015 at 12:32
Original issue reported on code.google.com by
Semenoff...@gmail.com
on 25 Nov 2011 at 4:59Attachments: