tkasu / imdb-list-analyzer

Analysis tools for movie rating lists exported from the Internet Movie Database (IMDb)
1 stars 1 forks source link

Nordic characters are not displayed correctly #9

Open tkasu opened 5 years ago

tkasu commented 5 years ago

See e.g. "Movies with unexpected ratings, Chart 1".

is this a new bug caused by new csv-file?

dresa commented 5 years ago

It's likely that when IMDb decided to change the export format for the ratings, they also changed the default text encoding of the CSV file, which is now ANSI cp1252. Classic encoding issue, that is.

If I remember correctly, we have supported UTF-8 without BOM, and I'd like to keep it that way. I just replaced the new-format example rating files and committed. The filenames are the same, but default ANSI cp1252 encoding has been switched to UTF-8 without BOM.

You should be able to pull my changes and everything should work, meaning that accented characters should appear correctly — they did on my browser.

tkasu commented 5 years ago

Merged file changes in 47aa480, example files work now.

It seems that encoding is defined in:

;imdb_data.clj

"Local encoding constant: to interpret special Western characters correctly,
 such as Scandinavian characters, make a best guess for the encoding.
 It could be, for example, 'UTF-8' or 'windows-1252'."
(def local-encoding (.name (Charset/defaultCharset)))

So your opinion is that we are not going to change that? Should we add a note to the README.md and Web GUI?

One alternative could be that the imdb-list-analyzer core would support an optional encoding command line argument that would override the above setting. It could be given as an additional argument for analysis functions and therefore could also be used from server.clj. In the Web GUI there could be a checkbox to let the users choose the encoding.

dresa commented 5 years ago

That's a very good idea.

I finally managed to enable command-line arguments for setting the encoding for both inbound and outbound data. I ended up doing a few other changes as well, like updating the example files and their names, resources/rates_A.csv being the example file from now on, some refactoring, etc. I listed the changes in the commit message https://github.com/dresa/imdb-list-analyzer/commit/96d040d93fca9a4e37e44b1f2ebae671349c96a1. It worked on my machine -- let's hope it's ok on other machines, too.

Since IMDb now offers the ratings file in windows-1252 encoding, that should be our default encoding option for incoming data. Having the option to choose an encoding in GUI (next to upload buttons) sounds great. I added function full-one-analysis-results to make integration easier for you (filename + in-encoding), I hope.

Supported encodings are based on JVM.