Closed axb21 closed 2 years ago
Hey @axb21 thx for catching that! Makes sense; would you like to create a PR with the proposed changes?
Also, to make it a part of the minor release we would need to make this change non API breaking (by still keeping the old function signature that defaults to ISO).
Hey @axb21 thx for catching that! Makes sense; would you like to create a PR with the proposed changes?
Also, to make it a part of the minor release we would need to make this change non API breaking (by still keeping the old function signature that defaults to ISO).
@pomadchin yes I can do that. It'll probably take me a few days -- busy time with $work right now! I will make the change additive and not break the existing API. I believe geotools uses the ISO encoding by default because of a constraint related to ESRI shapefiles, so it makes sense to leave that as the default in geotrellis too. But in some use cases, you will know that you have UTF-8 (or some other) character encoding, and it's nice to have the flexibility to specify it.
Describe the bug
When reading an ESRI shapefile that contains UTF-8 characters (e.g., foreign language properties), the data (properties) can look garbled. In fact, the reader is implicitly assuming the ISO-8859-1 character encoding via the geotools dependency.
Expected behavior
When reading ESRI shapefiles with UTF-8 characters, all characters should be read properly, possibly after caller sets the correct character set.
Environment
Additional context
The issue arises here: https://github.com/locationtech/geotrellis/blob/8541e5d97cdef71f6cd239cb6bb54a520e192002/shapefile/src/main/scala/geotrellis/shapefile/ShapeFileReader.scala#L52
By default, geotools's data store expects ISO-8859-1 as a string encoding: https://github.com/sigcorporativo-ja/panelGestion/blob/318def8a9eb1168d7bae9ee2c126fd8ff0c42129/src/main/java/org/geotools/data/shapefile/ShapefileDataStore.java#L143
geotools provides a
setCharset(charset: java.nio.charset.Charset)
method on data stores if you have properties other than ISO-8859-1. This way user can set the character set if they choose.I propose an alternate implementation of
readSimpleFeatures
and appropriate modifications to/additions of helper functions:This works fine for me and is what I am using in my project. I regret that I cannot share a sample file, but I am not permitted to do that with the data I have. You can test it well enough by making a layer with properties that have e.g. Chinese or Korean characters.