locationtech / rasterframes

Geospatial Raster support for Spark DataFrames
http://rasterframes.io
Apache License 2.0
244 stars 45 forks source link

Allow S3 endpoint to be specified for non-commercial use cases #577

Open petersedivec opened 2 years ago

petersedivec commented 2 years ago

We are using pyrasterframes v0.8.4 because we've been unable to get a more recent version running in databricks. We are attempting to load rasterframes from S3, but haven't been getting the following error when attempting to load from S3

at org.locationtech.rasterframes.expressions.generators.RasterSourceToRasterRefs.eval(RasterSourceToRasterRefs.scala:64) ... 25 more
Caused by: java.net.UnknownHostException: s3.amazonaws.com

We just noticed that in the stack trace and realize that since we're working in a classified environment the s3.amazonaws.com won't resolve. We now are successfully loading the rasterframes with pyrasterframes by manually copying the file from S3 and then loading from local dbfs file storage.

We had started playing around with rasterio and were able to load and use rasterio in databricks. Rasterio allow the S3 endpoint to be specified with rasterio.Env(AWSSession(session), AWS_S3_ENDPOINT='data.cloudferro.com', AWS_HTTPS='NO') as env

We attempted to follow the different acceptable URI formats, including what was referenced in https://github.com/locationtech/rasterframes/issues/38

We've been unable to get our s3 endpoint to work in either s3 or https URI format. Please add option to pass in a s3 endpoint

pomadchin commented 2 years ago

Hey @petersedivec yes indeed this functionality is not exposed through the Py API; but it should not be complicated to add.

The trick is to expose S3ClientProducer API in the acceptable form.

cc @metasim