rileypq / topic-modeling-tool

Automatically exported from code.google.com/p/topic-modeling-tool
0 stars 0 forks source link

Error with character encoding for UTF-8 files #7

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Run TMT with texts in UTF-8 which have words that have characters with 
accents, like "é" or "à". For example texts in French.

What is the expected output? What do you see instead?
- I would expect the topic words to include words that have an accented letter. 
Instead, the topic words will not include these, but include words cut off at 
those characters with accents instead, so "privé" becomes "priv" or "était" 
becomes "tait" or "prêt" becomes "pr" (without the final "t").  

What version of the product are you using? On what operating system?
- I'm using the latest version of TMT on Ubuntu 13.10. 
- Note that the procedure works just fine when I use Mallet directly. 

Original issue reported on code.google.com by C.Schoech@gmail.com on 9 Dec 2013 at 4:53