This PR introduces a configurable, dual-layer LRU cache for the S3IndexInput implementation. Also addresses a previously known issue where slicing the index input caused unnecessary memory consumption.
Data is now downloaded in 100MB (configurable) chunks to the disk, up to 200GB (configurable). This is stored on disk in an LRU fashion, which then feeds a much smaller 1GB (configurable) LRU cache cache on the heap, reading 2MB (configurable) chunks from the disk. All caches have been implemented using Caffeine, the successor to the Guava LoadingCache.
If long-term performance is satisfactory, we should consider moving the cache configs to the global config, as well as using the data directory config instead of the temp directory.
This was performance tested with the above cache configs, and the following pod & jvm settings:
More bench-marking should be performed to validate these configs on a variety of cluster configurations, as these are optimized for small clusters deployed on a heavily over-subscribed kube cluster.
Summary
This PR introduces a configurable, dual-layer LRU cache for the
S3IndexInput
implementation. Also addresses a previously known issue where slicing the index input caused unnecessary memory consumption.Data is now downloaded in 100MB (configurable) chunks to the disk, up to 200GB (configurable). This is stored on disk in an LRU fashion, which then feeds a much smaller 1GB (configurable) LRU cache cache on the heap, reading 2MB (configurable) chunks from the disk. All caches have been implemented using Caffeine, the successor to the Guava LoadingCache.
If long-term performance is satisfactory, we should consider moving the cache configs to the global config, as well as using the data directory config instead of the temp directory.
This was performance tested with the above cache configs, and the following pod & jvm settings:
More bench-marking should be performed to validate these configs on a variety of cluster configurations, as these are optimized for small clusters deployed on a heavily over-subscribed kube cluster.