locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 362 forks source link

Expose Charset in a ShapeFileReader API #3464

Closed vlulla closed 2 years ago

vlulla commented 2 years ago

Overview

This PR makes charset configurable, but defaults to ISO-8859-1 in order not to change bahavior of the old functions.

Checklist

Closes #3445

pomadchin commented 2 years ago

Let's do the following:

  1. We need a test that reproduces the error
  2. We need a test that makes sure that the reproduced error is fixed
  3. It looks like the change is slighly larger than two methods addition:
def readSimpleFeatures(path: String): Seq[SimpleFeature] = readSimpleFeatures(new URL(s"file://${new File(path).getAbsolutePath}"), DEFAULT_CHARSET)
def readSimpleFeatures(path: String, charSet: Charset): Seq[SimpleFeature] = readSimpleFeatures(new URL(s"file://${new File(path).getAbsolutePath}"), charSet)

def readSimpleFeatures(url: URL): Seq[SimpleFeature] = readSimpleFeatures(url, DEFAULT_CHARSET)
def readSimpleFeatures(url: URL, charSet: Charset): Seq[SimpleFeature] = ???

// but also all user API methods should have overloads

// ... (just an example below, all polygon methods should have the charset support) 
def readMultiPolygonFeatures(path: String): Seq[MultiPolygonFeature[Map[String,Object]]] =
  readMultiPolygonFeatures(path, DEFAULT_CHARSET)
def readMultiPolygonFeatures(path: String, charSet: Charset): Seq[MultiPolygonFeature[Map[String,Object]]] =
  readSimpleFeatures(path, charSet)
    .flatMap { ft => ft.geom[MultiPolygon].map(MultiPolygonFeature(_, ft.attributeMap)) }
// ...

The test may look this way:

class ShapeFileReaderSpec extends AnyFunSpec with Matchers {
  describe("ShapeFileReader") {
    it("should read MultiPolygons feature attributes") {
      val path = "shapefile/data/shapefiles/demographics/demographics.shp"
      val features = ShapeFileReader.readMultiPolygonFeatures(path)
      features.size should be (160)
      for(MultiPolygonFeature(_: MultiPolygon, data: Map[String, Object]) <- features) {
        data.keys.toSeq should be (Seq("LowIncome", "gbcode", "ename", "WorkingAge", "TotalPop", "Employment"))
      }
    }

    // https://github.com/locationtech/geotrellis/issues/3445
    it("should read UTF-8 MultiPolygons feature attributes") {
      val path = "shapefile/data/shapefiles/demographics-utf8/demographics.shp"
      val features = ShapeFileReader.readMultiPolygonFeatures(path, Charset.forName("UTF-8"))
      features.size should be (160)

      features.take(4).map(_.data("ename").asInstanceOf[String]) shouldBe Seq("南关街道", "七里烟香", "谢庄镇", "Cheng Guan Zhen")
    }
  }
}

Note: I pushed a test file that should work with the test above.

vlulla commented 2 years ago

OOPS, i did not see your later comments and tried to fix the typo....sorry!

vlulla commented 2 years ago

I have made all the changes that you had mentioned. Please let me know if I have still missed anything.

vlulla commented 2 years ago

Thank you for the patient guidance! It has been very educational.

axb21 commented 2 years ago

Ah great! I said in the other issue that I'd make a pull request for this but then got distracted by life. Sorry about that, but I'm glad to see @vlulla jumped in and did it!