Open echeipesh opened 7 years ago
Hi all @ GeoTrellis,
I recently submitted a proposal to Eclipse foundation for an other project. However I ran across this project and see that it interests me. Before going back to school for CS, I worked as a US Merchant Marine Officer and used Raster charts all the time on these in our electronic navigation chart display information systems.
I am very curious about geospatial data processing as I see a number of good applications of it in my former field. I'd like to submit a GSOC proposal for this project. Is this an issue that work is still actively needed on? I'd have to do a bit more research on Spark and HDFS API's to make a proposal, but I do have a fundamental understanding of the problem that might need to be solved.
Best, Oren
@echeipesh To what degree does this get addressed with CoGs? Our number 1 or 2 performance problem right now is S3 I/O...
Hi @orenwf,
We're not participating in the Eclipse group GSoC this year, which is unfortunate because it's great that you are interested in contributing to the project! Thank you for reaching out.
@metasim IMO this is totally addressed with COGs, and should be closed now that the Structured COGs PR is merged in.
Often the size of a tile (256x256) produces record size that is smaller than is efficient for storage. A good example of this is S3 where significant time of request processing is spent on key lookup and as a consequence requests for very small keys (like 256KB) takes about as much time as larger keys (up to megabyte or so). In effect we need to able to store larger records/tiles.
Some related benchmarking shows this effect here: https://gist.github.com/echeipesh/b3880b5087ea184870d5
Other applications that deal with raster data employ the idea of a "metatile" to cope with this problem. Metatile is a tile of tiles that makes the records being stored larger and thus more efficient.
S3 is an important storage use case; reducing the number of requests (thus reducing costs) and increasing throughput can have a big impact. So this feature focuses on S3 performance specifically but acceptable implementation should generalize well to other backends.
There are couple of possible ways to achieve this:
Use HDFS MapFile layer from
HadoopLayerWriter
.HDFS MapFiles seem like they would be applicable here, they will store multiple records, in sorted with an index so they provide reasonably efficient random access. I've made an attempt at testing this out and ran into following problems:
s3a
filesystem support requires that AWS access and secret key be provided in the hadoop configuration, this does not directly support using IAM profile for authentication. Some code will have to be written to get tokens from IAM profile and update hadoop configuration for the related job.hadoop-s3
dependency is no longer part of standard distribution and bringing it in caused some conflicts on EMR deployments. These will have to worked through.Overall this is a tempting option because it is very general. First steps would be to create a dirty use case and benchmark the performance that this can deliver.
Store neighboring records when writing GeoTrellis layers
In fact when GeoTrellis writes a record it already writes a
Array[(K, V)]
to cover those cases where space filling curve does not have sufficient resolution and multiple(K, V)
will collide, mapping to one SFC index. This typically happens with spatio-temporal records when multiple records exist per quanta of time.This can be exploited to course-en the SFC and store multiple, spatially neighboring,
(K, V)
pairs in same record. This is attractive because it will readily generalize across all current backends. However special attention will need to be paid in*ValueReader
s to caching to make sure that subsequent requests for two records that reside in the same "file" will not cause two fetched of that file.Store larger tiles (really values)
A more direct way to support this is to store large tiles. For instance store 2048x2048 tiles instead of 256x256. This would cause the spatial component of the key index to be shifted, because there would be fewer tiles covering the same area. There would have to be a wrapper for the layer reader and and value reader that would translate requests for tiles of desired size to key space of larger tiles, fetch the tile, split it out and translate the record keys.
Good part of this solution is that this reader adapter could be useful for those cases where tiles are stored in differing sizes for other reasons and this would provide a way to inter-operate on layers with different spatial resolution. However for translations which are not power of 2 (ex: get 256 tile from 512 tile) multiple records will need to fetched.
Overall this is likely the least general approach and deals more with working with tiles of different resolutions than optimizing IO. GeoTrellis has moved away from the idea of storing tiles towards the idea of storing spatially indexed
V
which may be a tile. So this method would not apply to those case where the records being stored can't be joined or split as easily as tiles.