locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 360 forks source link

`ShapefileReader` in geotrellis-shapefile implicitly assumes ISO-8859-1 character encoding and does not read shapefiles with UTF-8 (or other charset) contents correctly #3445

Closed axb21 closed 2 years ago

axb21 commented 2 years ago

Describe the bug

When reading an ESRI shapefile that contains UTF-8 characters (e.g., foreign language properties), the data (properties) can look garbled. In fact, the reader is implicitly assuming the ISO-8859-1 character encoding via the geotools dependency.

Expected behavior

When reading ESRI shapefiles with UTF-8 characters, all characters should be read properly, possibly after caller sets the correct character set.

Environment

Additional context

The issue arises here: https://github.com/locationtech/geotrellis/blob/8541e5d97cdef71f6cd239cb6bb54a520e192002/shapefile/src/main/scala/geotrellis/shapefile/ShapeFileReader.scala#L52

By default, geotools's data store expects ISO-8859-1 as a string encoding: https://github.com/sigcorporativo-ja/panelGestion/blob/318def8a9eb1168d7bae9ee2c126fd8ff0c42129/src/main/java/org/geotools/data/shapefile/ShapefileDataStore.java#L143

geotools provides a setCharset(charset: java.nio.charset.Charset) method on data stores if you have properties other than ISO-8859-1. This way user can set the character set if they choose.

I propose an alternate implementation of readSimpleFeatures and appropriate modifications to/additions of helper functions:

  def readSimpleFeatures(
      url: URL,
      charSet: java.nio.charset.Charset
  ): Seq[SimpleFeature] = {
    // Extract the features as GeoTools 'SimpleFeatures'
    val ds = new ShapefileDataStore(url)
    ds.setCharset(charSet)
    val ftItr: SimpleFeatureIterator = ds.getFeatureSource.getFeatures.features

    try {
      val simpleFeatures = scala.collection.mutable.ListBuffer[SimpleFeature]()
      while (ftItr.hasNext) simpleFeatures += ftItr.next()
      simpleFeatures.toList
    } finally {
      ftItr.close
      ds.dispose
    }
  }

This works fine for me and is what I am using in my project. I regret that I cannot share a sample file, but I am not permitted to do that with the data I have. You can test it well enough by making a layer with properties that have e.g. Chinese or Korean characters.

pomadchin commented 2 years ago

Hey @axb21 thx for catching that! Makes sense; would you like to create a PR with the proposed changes?

Also, to make it a part of the minor release we would need to make this change non API breaking (by still keeping the old function signature that defaults to ISO).

axb21 commented 2 years ago

Hey @axb21 thx for catching that! Makes sense; would you like to create a PR with the proposed changes?

Also, to make it a part of the minor release we would need to make this change non API breaking (by still keeping the old function signature that defaults to ISO).

@pomadchin yes I can do that. It'll probably take me a few days -- busy time with $work right now! I will make the change additive and not break the existing API. I believe geotools uses the ISO encoding by default because of a constraint related to ESRI shapefiles, so it makes sense to leave that as the default in geotrellis too. But in some use cases, you will know that you have UTF-8 (or some other) character encoding, and it's nice to have the flexibility to specify it.