locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
243 stars 46 forks source link

Issue reading geotrellis (2.3.3) catalog with rasterframes 0.8.4 with s3 backend when using a custom S3 Producer #444

Open jdenisgiguere opened 4 years ago

jdenisgiguere commented 4 years ago

Current situation

I have a geotrellis catalog using the S3 backend. Catalog and data are stored on a minio server. I'm using Geotrellis v2.3.3

When I try to access the catalog with rasterframes v0.8.4, I get the following error messages:

scala> catalogUri
res14: java.net.URI = s3a://geoimagery/geotrellis_geoimagery/

scala> spark.read.geotrellisCatalog(catalogUri)
scala.MatchError: List(metadata__geoimagery_2002__0.json) (of class scala.collection.immutable.$colon$colon)
  at geotrellis.spark.io.hadoop.HadoopAttributeStore$$anonfun$layerIds$1.apply(HadoopAttributeStore.scala:148)
  at geotrellis.spark.io.hadoop.HadoopAttributeStore$$anonfun$layerIds$1.apply(HadoopAttributeStore.scala:147)
  at scala.collection.immutable.List.map(List.scala:284)
  at geotrellis.spark.io.hadoop.HadoopAttributeStore.layerIds(HadoopAttributeStore.scala:147)
  at org.locationtech.rasterframes.datasource.geotrellis.GeoTrellisCatalog$GeoTrellisCatalogRelation.layers$lzycompute(GeoTrellisCatalog.scala:76)
  at org.locationtech.rasterframes.datasource.geotrellis.GeoTrellisCatalog$GeoTrellisCatalogRelation.layers(GeoTrellisCatalog.scala:64)
  at org.locationtech.rasterframes.datasource.geotrellis.GeoTrellisCatalog$GeoTrellisCatalogRelation.schema(GeoTrellisCatalog.scala:103)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:403)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.locationtech.rasterframes.datasource.geotrellis.package$DataFrameReaderHasGeotrellisFormat.geotrellisCatalog(package.scala:52)
  ... 63 elided

Expected situation

I would expect to be able to read the catalog with this configuration.

Detailled environnement

vpipkt commented 4 years ago

My only guess here is what version of GeoTrellis the catalog was created with?

Since the error is thrown in the geotrellis.spark.io.hadoop package, that's where I would go looking for changes. It looks like the packages have been reorganized in GT 3.x series but I'm not familiar with back compatibility situation for catalogs and layers.

jdenisgiguere commented 4 years ago

Thank you @vpipkt for your quick answer.

We use Geotrellis 2.3.3 which is the version required for rasterframes 0.8.4 according to project/RFDependenciesPlugin.scala.

I would expect to see S3AttributeStore instead of HadoopAttributeStore for a URI with the prefix s3a://.

vpipkt commented 4 years ago

Just a hunch here that maybe the geotrellis.spark.io.s3.S3LayerProvider is not on the classpath? Or perhaps the META-INF/services/geotrellis.spark.io.AttbitueSotreProvider is not listing geotrellis.spark.io.s3.S3LayerProvider ?

metasim commented 4 years ago

@jdenisgiguere do you happen to have a public version of s3a://geoimagery/geotrellis_geoimagery/ we could use to replicate the issue?

jdenisgiguere commented 4 years ago

I create a git repo with data to reproduce this issue: https://github.com/jdenisgiguere/rasterframes-minio-ZazJXB4U

The repo also contains code to read the Geotrellis Layer with Geotrellis v2.3.3 and a non-working attempt to read the same data with rasterframes 0.8.5. I have an issue with the management of Hadoop versions in the latter.

Thanks in advance for your help.

jdenisgiguere commented 4 years ago

I push a new commit in the proof of concept with rasterframes 0.8.5. This is my last stack trace. https://gist.github.com/jdenisgiguere/fe3d274d1baf2ba2730c920ff8abd128 .

jdenisgiguere commented 4 years ago

@vpipkt , you gave me a precious hint 3 weeks ago, but I did not have enough background to understand it well. So, using the protocol s3a:://, it is expected that the data is from a Hadoop Data Store. s3:// will use plain AWS Java SDK.
Geotrellis documentation provided explanation on how to configure a S3Provider to use minio, but I don't know how to this with rasterframes.

I could also modify my backend to save data in Geotrellis with a HadoopLayerWrite. Since we cannot use Minio as s3a storage source with the default hadoop version bundled with spark 2.4.4 (Hadoop v2.7), there are more to learn to be able to use pyrasterframes this way.

jdenisgiguere commented 4 years ago

To use Geotrellis S3 backend with Minio, you cannot provide only the Layer URI. You also need to provide the s3Client. https://github.com/locationtech/geotrellis/blob/master/s3/src/main/scala/geotrellis/store/s3/S3AttributeStore.scala#L43

If I understand well, we cannot currently provide this parameter when we want to read a geotrellis layer or a geotrellis catalog with rasterframes. https://github.com/locationtech/rasterframes/blob/develop/datasource/src/main/scala/org/locationtech/rasterframes/datasource/geotrellis/GeoTrellisRelation.scala#L62-L68

@vpipkt, if you think this is appropriate, this could be tagged as enhancement or close it since it is working as expected.