locationtech-labs / geopyspark

GeoTrellis for PySpark
Other
179 stars 59 forks source link

Add ability to provide S3 credentials directly to geotiff.get() #656

Closed chasegarner closed 6 years ago

chasegarner commented 6 years ago

resolves #527

chasegarner commented 6 years ago

@jbouffard, let me know if the approach is correct. I faked the context config in the tests but used pyspark and some docstrings as a guide for what APIs to call.

chasegarner commented 6 years ago

Fixed. I think I mentioned that I blew up my test VM, so I'm letting Travis catch my goofs for now. If it clutters the commit history, I can squash them together if you guys care.

jbouffard commented 6 years ago

@chasegarner Ah, okay. Yeah, once everything is resolved it'd be nice if some of the commits were squashed.

chasegarner commented 6 years ago

Should be good to go.

jbouffard commented 6 years ago

@chasegarner :+1:

I'm going to test this branch out since our unit tests are bad at picking up S3 issues. If it's all good I'll merge this. Otherwise, I'll let you know what the problems were.

jbouffard commented 6 years ago

@chasegarner Unfortunately, this PR doesn't work at the moment due to how we access data on S3 with the get function. When we read files from S3, we do so via an AmazonS3Client. The client that is used in get uses a DefaultAWSCredentialsProviderChain instance (see here) to gather the credentials needed to access the files on S3. The problem with that is DefaultAWSCredentialsProviderChain doesn't look at the Spark settings when finding credentials, so settings them in the SparkConf does nothing.

In order to get things working, we somehow need to pass in the credentials to the AmazonS3Client. I think the easiest thing to do would be to create the needed AmazonS3Client instance in the getS3Client value.

Something like this maybe:

    def setValues(
      sc: SparkContext,
      schema: String,
      intMap: Map[String, Int],
      stringMap: Map[String, String],
      partitionBytes: Option[Long]
    ): S3GeoTiffRDD.Options = {
      val crs: Option[CRS] =
        if (stringMap.contains("crs"))
          TileLayer.getCRS(stringMap("crs"))
        else
          None

     private val accessKey = sc.hadoopConfiguration.get(s"fs.{schema}.awsAccessKeyId", "")
     private val secretKey = sc.hadoopConfiguration.get(s"fs.{schema}.awsSecretAccessKey", "")

      val getS3Client: () => S3Client =
        stringMap.get("s3_client") match {
          case Some(client) =>
            if (client == "default" && accessKey.isEmpty)
              default.getS3Client
            else if (client == "default")
              () => AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey), S3Client.defaultConfiguraton)
            else if (client == "mock")
              () => new MockS3Client()
            else
              throw new Error(s"Could not find the given S3Client, $client")
          case None => default.getS3Client
        }

      S3GeoTiffRDD.Options(
        crs = crs,
        timeTag = stringMap.getOrElse("time_tag", default.timeTag),
        timeFormat = stringMap.getOrElse("time_format", default.timeFormat),
        maxTileSize = intMap.get("max_tile_size"),
        numPartitions = intMap.get("num_partitions"),
        partitionBytes = partitionBytes,
        chunkSize = intMap.get("chunk_size"),
        delimiter = stringMap.get("delimiter"),
        getS3Client = getS3Client
      )
    }
  }

What are your thoughts, @echeipesh?

I'm sorry for not noticing this earlier. I misunderstood how the credentials were passed into the AmazonS3Client. If you'd like, I could make a PR against your branch that adds the required Scala code and updates the Python side if needed.

jpolchlo commented 6 years ago

Ooof. This is starting to get at a significant problem with our general approach to S3 in Geotrellis. It looks like we lack flexibility in the general API for using S3ValueReader with a non-default provider chain. This will likely require some issues to be created on the GT repo.

If it does get implemented and committed there, then you'll have to accept not using s3://...-formatted URIs, but rather making an interface in GPS for creating an S3 source, since the SPI interface doesn't currently allow for those credentials to be passed in the URI (nor would you want to?).

(Forgive me for not delving very deeply into your PR before sounding off on it, in case I missed some details in your implementation)

echeipesh commented 6 years ago

@jbouffard Thats the correct solution here to AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey)...

This probably should propagate upstream to GeoTrellis where we create our own GeoTrellisCredentialProviderChain that either mirrors or composes over DefaultAWSCredentialsProviderChain but preferentially uses the key/value from sparks HadoopConfiguration if its set.

jbouffard commented 6 years ago

@echeipesh I see. That's what I thought. I was looking into implementing our own ProviderChain for this case, but scrapped it since the other way is more straightforward. From what I saw, it looks like making our own GeoTrellisCredentialProviderChain should be pretty easy. That's something we could do for the 2.0 release if we have the time.