locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 361 forks source link

GeoTrellis Roadmap #1565

Closed lossyrob closed 4 years ago

lossyrob commented 8 years ago

We need a document that describes future GeoTrellis features which are the project plans to implement, although work on them might not have yet started. This issue is to work through some ideas for the roadmap. Please comment with ideas and feedback.

Below is a start of some ideas to seed discussion:

Vectors in Spark

In the 0.10 release, we gain a lot of functionality around loading, saving, and processing large raster data in Spark. GeoTrellis will implement similar functionality for vector data.

Vector Tiles

Work has already kicked off on creating a Vector Tile codec, so that we will be able to read and write vector data as vector tiles. Since vector tiles can be keyed the same as raster data, this will allow for analysis between large vector and raster datasets. For instance: if we have Open Street Map data loaded as vector tiles, some things we could do are:

Some distributed vector operations we would like to implement in GeoTrellis are:

Using computer vision at scale to derive information out of satellite imagery, and using machine learning at scale against large raster datasets has long been a goal of the GeoTrellis project. Here are some ideas around using these techniques to produce valuable results:

One large request we hear often is: how do we use GeoTrellis while getting around having to learn Scala. As a Scala developer and fan of the language, of course my wish is to have our user base embrace Scala as a wonderful language to code geospatial applications in, but I completely understand the need for us to move towards tooling that will be more generally useful to a wider audience in the open source geospatial community. SparkSQL can be the answer: if we develop a set of User Defined Functions and User Defined Types (UDFs and UDTS) in SparkSQL, that allow users to execute PostGIS-like functions against their ingested geospatial data, we would open up the user base to anyone who knows SQL (which is a much larger slice of the pie than people that know Scala). This will be quite an undertaking, but one of the efforts on the road map that I'm most excited about.

Also, Spark is generally moving to DataSets and DataFrames. This is all the buzz around Spark 2.0: don't use RDDs, use DataSets and DataFrames. The problem with that is, our geospatial use case is complex enough to warrant the need for the low level control that RDDs provide. Also, we built up some really great functionality on the RDD API, and we will be supporting that for the foreseeable future. However, if the main focus of future Spark development is moving toward Data(Sets/Frames), then it's worth us to start exploring how we can support users doing geospatial on those types of distributed collections. This ties in heavily with the SparkSQL work, since if we are working with SparkSQL we are necessarily dealing with DataFrames.

Python

Along with SparkSQL, we can use Python bindings as a way to expose GeoTrellis functionality to users who are not in a Scala environment. Spark has it's own method for access Spark through Python (the very successful PySpark), and there's a lot of potential to take their example. There is already some work to expose GeoTrellis layers as PySpark RDDs of rasterio rasters, by @shiraeeshi, which can be found here: https://github.com/geotrellis/geotrellis/pull/1459

Integrating with other open source projects

fosskers commented 8 years ago

geotrellis-vectortile is being written in such a way to allow for multiple codec backends (protobuf, geojson, etc). Protobuf is the original, and so will be presented as the recommended default.

For parsing massive GeoJSON, we will need to look into some of the streaming json libraries, as spray-json won't exactly cut it here. @moradology ?

timothymschier commented 8 years ago

Hi Rob, thanks for soliciting input.

echeipesh commented 8 years ago

Reading Large Files

@timothymschier really good point about reading in large files. It seems the only reasonable solution is to have ability to do windowed reads. @MyNameIsBouf is currently pushing this feature for our GeoTiff reader and that will capture a lot of cases, but I imagine the there are other formats that are of interest.

It would be good to have some level of abstraction here, maybe even a thin service that can run closer to the data that we can hit and pull partial images form. As far as I understand in scientific community OPeNDAP serves similar purpose. I am eager to get more input on this issue.

GeoTrellis Vector

To expand on this issue, we're not very interested in storing and indexing vector data, that space is pretty well covered by GeoWave and GeoMesa. I think the only implementation that can be talked about is something pretty basic, that gives a deployment option without Accumulo. However with vector data there is a lot more interest to query based on feature attributes which greatly complicates the problem. Other people do this better already in short.

What we're interested is doing vector operations efficiently in distributed memory, in spark RDD/DataSet essentially...

SpatialJoins

Thinking through the use-cases very broadly:

But given these operations with reasonable efficiency you would be quite flexible with what you could do with vector datasets in memory.

VectorTiles

These I think are mostly of interest for analysis. I think it would be ideal to find some interface that abstracts over raster and vector tiles given their cell-driven nature. Uses that come to mind:

Topology/TopoJSON support

Mentioned here: https://github.com/geotrellis/geotrellis/issues/1289

Specifically as we manipulate/re-project/simplify sets of geometries topology as a scala datatype would be very useful to avoid common pitfalls with gaps appearing on touching edges.

Machine Learning

Just want to echo the sentiment that our job is not to implement ML models but to provide a pipeline to train some models on imagery and then use trained models to segment the images and/or detect anomalies.

Ad-Hoc Raster Processing

Several features that fall into one category: What happens when two raster layers don't share a layout. This can happen in several ways:

All three entail doing some kind of on the fly interpolation and specifying what resolution you want the output at. This is going to be critical for a robust ML pipeline because we will need to draw features from layers that exist for different reasons, covering slightly different areas at different resolutions. We can expect features to be actually a combination of multiple such mismatched layers.

This task is actually aided by work on vector spatial joins, because you're probably going to be down to working Feature[Raster[Tile]] types and doing a large spatial join to figure out which tile interests which.

Taking this to the logical conclusion it should allow us to do operations on rasters that have not been tiled at all, maybe rasters that have just been read from S3/HDFS/HTTP and at most chunked. We are certain that such operations will be slower than operations on pre-tiled datasets, but for ad-hoc analysis the time combined time of (tiling + operation) is very likely to be less than such an ad-hoc join.

It's unclear how valuable this feature is within itself, but it's pretty clear that it is partially required for work on ML pipelines and spatial join. I would say that framing that work in terms of such a general feature would produce better API and would also give us this useful feature as nearly a side-effect.

Raster Stream Processing

The direct feature here is a streaming ingest. The current situation is that if during Ingest the tiling shuffle breaks the memory allocation the process is very likely to crash. What would be ideal is if instead of crashing we are able to process the ingest incrementally, albeit slower. This obviously requires partial processing / persist cycles in the middle of ingest process. As a side effect this may make the ingest process recoverable since recording partial result is essentially checkpointing.

Can RDD mechanics always save us from this crash? I don't think so. Given a fixed cluster size and assuming perfect partitioning we will still run out of memory and eventually spill space while working on large enough shuffle. We essentially need a partial shuffle, which is very much an application level concept.

Overall this can also be treated as a problem of trading available computational resources for processing time, which is something that computers do all the time, we just need to do it in the distributed case.

Additionally stream processing lines up nicely with event notification pipeline: You may have some imagery coming in, you want to compare it with previously saved imagery, detect and changes, flag anomalies, record the data and issue notifications. Implied here is doing this in some kind of fixed resource environment.

Another critical aspect of supporting a stream case is that it allows for gradual scaling. Ability for a process to continue, allows for opportunity to inspect the backlog of work and scale the cluster without interruption. This is theoretically possible while executing a normal RDD operation, but I don't know of any mechanism to anticipate these "bulges" in processing pipelines which will cause the job to blowout.

It would be very helpful to have feedback on this issue in particular, I feel there are people who may have learned lessons here well ahead of us.

lossyrob commented 8 years ago

@timothymschier thanks for the input! Some responses

lossyrob commented 8 years ago

Another thing I would like to add to the roadmap:

Collection API

Currently we are able to read single tiles out of the various backends through ValueReaders. If we need to read out multiple tiles based on a query, we then need to read out an RDD. However, there are some use cases where you want to read out a number of tiles from a backend, where the number of tiles will definitely fit into memory on the executing machine. In this case, using RDDs and going through Spark is a lot of overhead. Ideally, we would have backend readers that would read out collections of tiles, which can then be worked with via the normal map algebra, etc methods.

lossyrob commented 8 years ago

We should also add MrGeo integration to the list.

fosskers commented 8 years ago

@lossyrob re: Collection API. Does that mean obfuscating RDD[(K,V)] with Metadata[M] behind some unified Geotrellis collection type? GeoSeq or otherwise.

lossyrob commented 8 years ago

@fosskers that obfuscation would lead to a lot of unfortunate consequences; for example, if we were to hide the fact that a GeoSeq was an RDD, we would lose all of the RDD methods on that instance, and being able to treat a TileLayerRDD[K] for example as a regular RDD is a convenience that I would be hard pressed to give up. So we get duplication with the Collections API being separate, but I think if we stick to conventions about what methods are available and how they are called we can at least have them seem like the same sort of type, without having to bake it into the type system.

pomadchin commented 8 years ago

Machine Learning

ML is one of things GT should support; mb even to provide some models to train, at least to demonstrate how GT works with ML frameworks / etc (to select fields on raster data); however that can be done in "demo" projects as well.

Raster Stream Processing

re to @echeipesh post: Does it mean a potential Kafka support? (i see that it's rather popular, GeoMesa and GeoWave examples)

lossyrob commented 8 years ago

Another item for the roadmap:

Integration with SFCurve

I think there's a lot of work that we could do on our SFC indexing, and I'd prefer to do that work collaboratively on the sfcurve project (https://github.com/locationtech/sfcurve). This would include a couple of changes to SFCurve:

fosskers commented 8 years ago

How are those Array[Byte] used once calculated? Is arithmetic done on them, or are they essentially just DB row IDs?