spark.read.geotrellisCatalog do not use spark.hadoop.fs.s3a.* config correctly

locationtech / rasterframes

Geospatial Raster support for Spark DataFrames

http://rasterframes.io

Apache License 2.0

243 stars 46 forks source link

spark.read.geotrellisCatalog do not use spark.hadoop.fs.s3a.* config correctly #457

Open jdenisgiguere opened 4 years ago

jdenisgiguere commented 4 years ago

Current situation

When I to read Geotrellis catalog with an s3a:// URI using spark.read.geotrellisCatalog, I get the following error: https://gist.github.com/jdenisgiguere/61161a1bd9636ec91c3b75cbb6a845b9

A workaround is to first read data with the spark.read.parquet method in the same bucket. After this call, spark.read.geotrellisCatalog will be able to read data.

See: https://github.com/jdenisgiguere/rasterframes-minio-ZazJXB4U/blob/acaa3b1de2372a642223ff9b48abba9d8e208dd5/read-with-rasterframes0.9/src/main/scala/io/anagraph/zazjxb4u/RfBisReader.scala#L69-L72

Expected situation

Prior invocation of spark.read.parquet should not be required to read a Geotrellis HadoopAttributeStore.

vpipkt commented 4 years ago

The second line of the Gist you posted points out that the credentials are missing....

In your posted code I do see the credentials explicitly set, which explains how the parquet read with s3a:// scheme works.

Can you double check that the credentials are set in the initial attempt to read geotrellisCatalog, without the parquet?

jdenisgiguere commented 4 years ago

@vpipkt , to reproduce the issue, the only thing I do is commenting the lines 71 and 72. If they are commented, I get the error message, if theyr are not, the spark.read.geotrellisCatalog is working.

vpipkt commented 4 years ago

@metasim any reason why the geotrellis catalog reader would not honor the config options setting s3 credentials?

metasim commented 4 years ago

@vpipkt @jdenisgiguere I'm surprised it's not getting passed along, because under the covers it's actually pulling from native Spark datasources, which I'd expect to honor that. IOW, how would spark.read.json be any different than spark.read.parquet?

metasim commented 4 years ago

Actually, those calls are actually just parsing data that's been fetched by other operations. Probably, need to look at the GeoTrellis APIs to see if they pay attention to the Spark properties.

jdenisgiguere commented 4 years ago

This may also be an issue with the use of different Hadoop library versions in the final application build.