ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Ensure UTF-8 locale is in use, or at least warn if not. #190

Open anjackson opened 6 years ago

anjackson commented 6 years ago

If you run the indexer in a non-UTF-8 locale, the data sent to Solr is not properly encoded (this was noticed when inspecting the text extracted from a test WARC on the Katacoda system).

Really, the code should enforce UTF-8 encoding, so we should review for any code creating Strings from bytes without specifying the encoding. However, it may be that the underlying libraries deliberately chose to use the platform encoding (a perfectly reasonable choice), in which case we should at least warn the use when they are running in a non-UTF-8 locale and encourage them to use e.g. -Dfile.encoding=UTF-8 (see here) which should do the trick.

tokee commented 5 years ago

I have tried reproducing this without luck. A test-WARC or a unit-test would be appreciated.