locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 360 forks source link

Default configuration S3Client #2923

Open moradology opened 5 years ago

moradology commented 5 years ago

Classes which are summoned via the ClassLoader mechanism (as in the SPI pattern) must not have parameters in their constructors. A CassandraLayerReader and an S3LayerReader are both instanced through the same method which takes a string and delegates to the appropriate SPI class depending on the pattern found in said string. This is handy, but it elides important details in at least the S3 case: S3Client objects are complicated and the 'Default Credential Provider' chain won't always be sufficient.

Given these two constraints (no parameters and the need to configure in certain cases), we need either

  1. a singleton which registers a default mechanism for creating AWS clients,
  2. SPI machinery to pull the desired class at runtime, or
  3. static configuration which is sufficient to the S3ClientBuilder API
echeipesh commented 5 years ago

Related: https://github.com/locationtech/geotrellis/pull/2911

Implementation of option 1: https://github.com/locationtech/geotrellis/blob/667d67626051d5bf45a7b9eeb187ea8db192aba5/s3/src/main/scala/geotrellis/spark/io/s3/S3ClientProducer.scala

CloudNiner commented 5 years ago

@moradology @echeipesh could one of you clarify the ask here?

As far as I can tell, it appears that (1) is already implemented in master as S3ClientProducer and is used in a number of places: Screen Shot 2019-10-01 at 9 16 05 AM

echeipesh commented 5 years ago

@CloudNiner S3ClientProducer does not use SPI, so the only way to change the default client used through-out is by calling the static S3ClientProducer.set method.

Assertion: changing the default timeout and back-off settings on the client is a very common use case. Being able to provider a client directly to layer/writer/reader covers it partially but is not sufficient. This is actually not much of assertion, we've seen this many times.

Opinion: Having a global mutable setting for current client is not desirable long term. Its not standard and potentially dangerous if in client code it gets set in multiple places or out of order. Both of which are possible as projects grow and code modules get combined.

Ask: Replace the global state with boring SPI

Use cases:

Open question: Should the SPI provide and instance of the client or just the configuration for it ? If we ask only for configuration we can ensure that we re-use a single s3 client across the application. Whereas if we asked for the client this would have to be an additional consideration for the user. I believe the AWS docs state that re-using the client is desirable.