If you run the indexer in a non-UTF-8 locale, the data sent to Solr is not properly encoded (this was noticed when inspecting the text extracted from a test WARC on the Katacoda system).
Really, the code should enforce UTF-8 encoding, so we should review for any code creating Strings from bytes without specifying the encoding. However, it may be that the underlying libraries deliberately chose to use the platform encoding (a perfectly reasonable choice), in which case we should at least warn the use when they are running in a non-UTF-8 locale and encourage them to use e.g. -Dfile.encoding=UTF-8 (see here) which should do the trick.
If you run the indexer in a non-UTF-8 locale, the data sent to Solr is not properly encoded (this was noticed when inspecting the text extracted from a test WARC on the Katacoda system).
Really, the code should enforce UTF-8 encoding, so we should review for any code creating Strings from bytes without specifying the encoding. However, it may be that the underlying libraries deliberately chose to use the platform encoding (a perfectly reasonable choice), in which case we should at least warn the use when they are running in a non-UTF-8 locale and encourage them to use e.g.
-Dfile.encoding=UTF-8
(see here) which should do the trick.