Closed chasegarner closed 6 years ago
@jbouffard, let me know if the approach is correct. I faked the context config in the tests but used pyspark and some docstrings as a guide for what APIs to call.
Fixed. I think I mentioned that I blew up my test VM, so I'm letting Travis catch my goofs for now. If it clutters the commit history, I can squash them together if you guys care.
@chasegarner Ah, okay. Yeah, once everything is resolved it'd be nice if some of the commits were squashed.
Should be good to go.
@chasegarner :+1:
I'm going to test this branch out since our unit tests are bad at picking up S3 issues. If it's all good I'll merge this. Otherwise, I'll let you know what the problems were.
@chasegarner Unfortunately, this PR doesn't work at the moment due to how we access data on S3 with the get
function. When we read files from S3, we do so via an AmazonS3Client
. The client that is used in get
uses a DefaultAWSCredentialsProviderChain instance (see here) to gather the credentials needed to access the files on S3. The problem with that is DefaultAWSCredentialsProviderChain
doesn't look at the Spark settings when finding credentials, so settings them in the SparkConf
does nothing.
In order to get things working, we somehow need to pass in the credentials to the AmazonS3Client
. I think the easiest thing to do would be to create the needed AmazonS3Client
instance in the getS3Client value.
Something like this maybe:
def setValues(
sc: SparkContext,
schema: String,
intMap: Map[String, Int],
stringMap: Map[String, String],
partitionBytes: Option[Long]
): S3GeoTiffRDD.Options = {
val crs: Option[CRS] =
if (stringMap.contains("crs"))
TileLayer.getCRS(stringMap("crs"))
else
None
private val accessKey = sc.hadoopConfiguration.get(s"fs.{schema}.awsAccessKeyId", "")
private val secretKey = sc.hadoopConfiguration.get(s"fs.{schema}.awsSecretAccessKey", "")
val getS3Client: () => S3Client =
stringMap.get("s3_client") match {
case Some(client) =>
if (client == "default" && accessKey.isEmpty)
default.getS3Client
else if (client == "default")
() => AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey), S3Client.defaultConfiguraton)
else if (client == "mock")
() => new MockS3Client()
else
throw new Error(s"Could not find the given S3Client, $client")
case None => default.getS3Client
}
S3GeoTiffRDD.Options(
crs = crs,
timeTag = stringMap.getOrElse("time_tag", default.timeTag),
timeFormat = stringMap.getOrElse("time_format", default.timeFormat),
maxTileSize = intMap.get("max_tile_size"),
numPartitions = intMap.get("num_partitions"),
partitionBytes = partitionBytes,
chunkSize = intMap.get("chunk_size"),
delimiter = stringMap.get("delimiter"),
getS3Client = getS3Client
)
}
}
What are your thoughts, @echeipesh?
I'm sorry for not noticing this earlier. I misunderstood how the credentials were passed into the AmazonS3Client
. If you'd like, I could make a PR against your branch that adds the required Scala code and updates the Python side if needed.
Ooof. This is starting to get at a significant problem with our general approach to S3 in Geotrellis. It looks like we lack flexibility in the general API for using S3ValueReader with a non-default provider chain. This will likely require some issues to be created on the GT repo.
If it does get implemented and committed there, then you'll have to accept not using s3://...
-formatted URIs, but rather making an interface in GPS for creating an S3 source, since the SPI interface doesn't currently allow for those credentials to be passed in the URI (nor would you want to?).
(Forgive me for not delving very deeply into your PR before sounding off on it, in case I missed some details in your implementation)
@jbouffard Thats the correct solution here to AmazonS3Client(new BasicAWSCredentials(accessKey, secretKey)..
.
This probably should propagate upstream to GeoTrellis where we create our own GeoTrellisCredentialProviderChain
that either mirrors or composes over DefaultAWSCredentialsProviderChain
but preferentially uses the key/value from sparks HadoopConfiguration
if its set.
@echeipesh I see. That's what I thought. I was looking into implementing our own ProviderChain
for this case, but scrapped it since the other way is more straightforward. From what I saw, it looks like making our own GeoTrellisCredentialProviderChain
should be pretty easy. That's something we could do for the 2.0 release if we have the time.
resolves #527