Problems with UTF-8 support for Windows

shawnanctil commented 7 years ago

I may have missed something, but I have a series of files that are encoded UTF-8, but when I run the tool I get all sorts of ASCII characters in my topics (i.e. - "â", etc). I'm wondering if there's a stage in the processing where files are converted to ASCII and then not re-encoded? I could be way off base with this question.

That said, I have gone through my files, ensured they are UTF-8, and done a find and replace for "â" in all the files. If you've come across this issue before I would love to know how you resolved it. At this point I'm thinking of creating an elaborate stop list that excludes common ASCII characters.

senderle commented 7 years ago

Encodings have been a nightmare since I started using the tool, and this is one of the main reasons I started modifying it. In recent tests, I haven't been seeing what you describe in the tool itself -- for example, a recent dataset I saw had some Devanagari text that became its own topic, and it rendered nicely -- but Excel often botches the CSV import, so that those weird ASCII characters pop up. Are you viewing the output in Excel by any chance?

Also, Java's default behavior is to expect files with a text encoding based on your locale. That default is hard to override, but not impossible. At some point, I'm going to do the annoying work of making it overridable. In the meanwhile, I'm assuming that people will have their locale configured as necessary. You might double-check that your locale is set to UTF-8.

Of course it might still be the tool itself. If you could attach some sample data I could test it out and see what happens on my machine? At least then we'd know whether it's system-dependent or a problem in the deeper logic of the software.

shawnanctil commented 7 years ago

Thanks for the response. I am viewing things in notepad++ and the ascii characters are showing up in the tool itself.

As far as I can tell I'm running on utf-8. Also, all of my files are utf-8. (For this second step I manually saved the files as utf-8). I'm sending you an email.

senderle commented 7 years ago

Thanks! I'll continue the conversation here to ensure it's all documented for future visitors. One thing that I noticed is that the stoplist file and the metadata file are both still ASCII formatted (according to the unix file command). I'll run the data with them as they are now, and then convert them and see what changes.

I forgot to ask before -- Windows? If so, and this turns out to be a locale problem, these instructions might help. But I'm not very familiar with Windows!

senderle commented 7 years ago

Ah, also, I notice that the stoplist isn't in the right format. I'll need to document that. It should have one word per line, rather than a single comma-separated list. That's probably the problem there.

senderle commented 7 years ago

Update: Although we've covered several of the things discussed above, the core issue remains: the tool works with unicode text correctly on Mac but not (sometimes) on Windows. This issue may be replaced by several more precise issues later, once we know what's going on.

senderle commented 7 years ago

After some additional research, it seems this might be related to this problem. To be tested.

shawnanctil commented 7 years ago

Regarding changing the region on Windows, it's currently set to English (Canada) and this is the output I get in the console:

0 0.23693 signal processing goldstein computer digital control systems filter image called filters theory work people university applications research system analysis adaptive 1 0.12487 nebeker â engineering research university maclennan pedersen goetzberger thoren high work bosch fettweis power physics fung war design important electrical 2 0.1405 switching aspray joel system bell people telephone things systems penzias lof thing lot kind laboratories don put labs company ve 3 0.11502 â kata women engineering 00 yeah swe lucietto laughs johnson kind engineers didn engineer harness anne fletcher school ledo don 4 0.1849 hochheiser radar lab geselowitz program system air group military equipment silver â mccomas goldstein bob force power vester electronics systems 5 0.02955 wilson swent utah mining mine company construction australia coal time long langer bhp san business government board guess didn francisco 6 0.13426 â kind robotics burnett robots robot yeah ve gibbs sort computer stuff lab vision things called stanford rare brian system 7 0.16895 ieee hochheiser society president board vardalas committee â time stern finn teare sell engineering activities year meetings ire director organization 8 2.61317 people time things work lot don good didn thing years back worked ve started make big working put company interesting 9 0.09709 â abbate women computer laughs didn ann janet hardy program hersom computing cooper don ibm programming data worked wasn math

Regarding your last comment from 2 days ago, should I redownload the .jar file? Or try to make the changes myself?

senderle commented 7 years ago

@shawnanctil -- I've had a couple of deadlines and have had to set work on this aside for a couple of days, so there haven't been any updates. I'll let you know when I've made the changes. I think I will be able to get to them soon.

One thing to try in the meanwhile would be to run mallet itself. I set the console up to display the mallet command you'd need to run to duplicate the results; you should be able to just copy and paste. The output won't have metadata but it might be a useful debugging exercise, and you might also find that you don't need the metadata output, and that this gets you what you need right now.

(Also, obviously if you can figure out how to fix it yourself and submit a PR, great! But I'm not even sure I know how to fix it yet...)

senderle commented 7 years ago

@shawnanctil I think I've fixed the stoplist issue. Fix is in the new jar build (but not the zip build). The unicode problem is more vexing and I'm afraid I can't do anything with it until after the weekend. In the meanwhile, I guess your best bet is to use a Mac? Many apologies -- I know this is probably frustrating. (It certainly is for me -- this is why I don't use Windows anymore!)

senderle commented 7 years ago

This appears to be a problem with MALLET itself as well. Hopefully if there's a system-wide fix, it will not require us to read and rewrite all the mallet output files in a different encoding!

senderle commented 7 years ago

@shawnanctil In a conversation with @scottkleinman we figured out that you can set a system-wide java default encoding with this flag:

-Dfile.encoding=UTF8

The flag can be passed as a command-line argument like so:

java -Dfile.encoding=UTF8 -jar TopicModelingTool.jar

Or it can be set as a windows environment variable:

JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"

If that variable is already being used you'll need to append this with whitespace, following the rules described here.

This is just a quick fix, but @scottkleinman reports getting correct UTF-8 text after making this change. Hope it helps!

senderle commented 7 years ago

This is as fixed as it's going to get, unless we hear about more problems. Use the native app builds (TopicModelingTool.app and TopicModelingTool.zip) if you want simple, straightforward UTF-8 support.

senderle / topic-modeling-tool

Problems with UTF-8 support for Windows #48