GeoTrellis Roadmap - Githubissues

lossyrob commented 8 years ago

We need a document that describes future GeoTrellis features which are the project plans to implement, although work on them might not have yet started. This issue is to work through some ideas for the roadmap. Please comment with ideas and feedback.

Below is a start of some ideas to seed discussion:

Vectors in Spark

In the 0.10 release, we gain a lot of functionality around loading, saving, and processing large raster data in Spark. GeoTrellis will implement similar functionality for vector data.

Vector Tiles

Work has already kicked off on creating a Vector Tile codec, so that we will be able to read and write vector data as vector tiles. Since vector tiles can be keyed the same as raster data, this will allow for analysis between large vector and raster datasets. For instance: if we have Open Street Map data loaded as vector tiles, some things we could do are:

create elevation profiles for roads by combining that data with elevation raster data
flag where open street map may need to be updated due to changes detected through satellite imagery.
Distributed vector operations

Some distributed vector operations we would like to implement in GeoTrellis are:

Spatial Join
Distributed Rasterization
Computer vision / Machine Learning

Using computer vision at scale to derive information out of satellite imagery, and using machine learning at scale against large raster datasets has long been a goal of the GeoTrellis project. Here are some ideas around using these techniques to produce valuable results:

Feature extraction/Image segmentation of aerial imagery to derive vector datasets
Anomaly detection and object detection using supervised and unsupervised techniques across raster data.
Predict optimal siting via historical training data, e.g. business siting via economic data or crop planting via yield data.
Additional Raster capabilities
Streaming ingest
Working with layers across resolutions
Additional distributed operations:
- Viewshed
- Cost distance
- Hydrology operations
  Datasets, DataFrames and Spark SQL

One large request we hear often is: how do we use GeoTrellis while getting around having to learn Scala. As a Scala developer and fan of the language, of course my wish is to have our user base embrace Scala as a wonderful language to code geospatial applications in, but I completely understand the need for us to move towards tooling that will be more generally useful to a wider audience in the open source geospatial community. SparkSQL can be the answer: if we develop a set of User Defined Functions and User Defined Types (UDFs and UDTS) in SparkSQL, that allow users to execute PostGIS-like functions against their ingested geospatial data, we would open up the user base to anyone who knows SQL (which is a much larger slice of the pie than people that know Scala). This will be quite an undertaking, but one of the efforts on the road map that I'm most excited about.

Also, Spark is generally moving to DataSets and DataFrames. This is all the buzz around Spark 2.0: don't use RDDs, use DataSets and DataFrames. The problem with that is, our geospatial use case is complex enough to warrant the need for the low level control that RDDs provide. Also, we built up some really great functionality on the RDD API, and we will be supporting that for the foreseeable future. However, if the main focus of future Spark development is moving toward Data(Sets/Frames), then it's worth us to start exploring how we can support users doing geospatial on those types of distributed collections. This ties in heavily with the SparkSQL work, since if we are working with SparkSQL we are necessarily dealing with DataFrames.

Python

Along with SparkSQL, we can use Python bindings as a way to expose GeoTrellis functionality to users who are not in a Scala environment. Spark has it's own method for access Spark through Python (the very successful PySpark), and there's a lot of potential to take their example. There is already some work to expose GeoTrellis layers as PySpark RDDs of rasterio rasters, by @shiraeeshi, which can be found here: https://github.com/geotrellis/geotrellis/pull/1459

Integrating with other open source projects

GeoWave
GeoMesa
GeoServer

fosskers commented 8 years ago

geotrellis-vectortile is being written in such a way to allow for multiple codec backends (protobuf, geojson, etc). Protobuf is the original, and so will be presented as the recommended default.

For parsing massive GeoJSON, we will need to look into some of the streaming json libraries, as spray-json won't exactly cut it here. @moradology ?

timothymschier commented 8 years ago

Hi Rob, thanks for soliciting input.

A big but simple gap we see in GeoTrellis from a clean usability perspective is being able to ingest large files. We're working with specialist algorithms to generate large area mosaics, only to then have to turn around and piece them up in smaller files before we ingest them. If GT could even just do this cutting-up invisibly that would certainly make for a more professional user experience.
+1 on all of Python, SparkSQL and DataSets/DataFrames (in that order). Great core features to have in the library, all of them. Do you have an idea of how much DataSets/DataFrames may improve performance?
I would question a Vector implementation - is this different to GeoMesa/GeoWave, or would you be integrating directly with one of those to fill this gap?
More than writing specific CV/ML algorithms, I think making GT the place for people to plug in existing/try new algorithms would be a great path forward. E.g. Extending MLLib for raster, adding Caffe/et al integration, etc. Bring the ML/CV dev community to GT (rather than competing with some of the big frameworks). Similarly, I'm not sure if it's really GT layer of the stack, but somehow helping support GPU computation would be a boon as well.

echeipesh commented 8 years ago

Reading Large Files

@timothymschier really good point about reading in large files. It seems the only reasonable solution is to have ability to do windowed reads. @MyNameIsBouf is currently pushing this feature for our GeoTiff reader and that will capture a lot of cases, but I imagine the there are other formats that are of interest.

It would be good to have some level of abstraction here, maybe even a thin service that can run closer to the data that we can hit and pull partial images form. As far as I understand in scientific community OPeNDAP serves similar purpose. I am eager to get more input on this issue.

GeoTrellis Vector

To expand on this issue, we're not very interested in storing and indexing vector data, that space is pretty well covered by GeoWave and GeoMesa. I think the only implementation that can be talked about is something pretty basic, that gives a deployment option without Accumulo. However with vector data there is a lot more interest to query based on feature attributes which greatly complicates the problem. Other people do this better already in short.

What we're interested is doing vector operations efficiently in distributed memory, in spark RDD/DataSet essentially...

SpatialJoins

Thinking through the use-cases very broadly:

Transformation of vector data doesn't require special thought, we do this pretty well already
Joining large dataset of geometries (twitter points) to small (polygons of countries) can be done with broadcast join very efficiently.
Joining two large RDD/DataFrames of varied vectors (points/polygons/lines) is a trick I haven't seen yet.

But given these operations with reasonable efficiency you would be quite flexible with what you could do with vector datasets in memory.

VectorTiles

These I think are mostly of interest for analysis. I think it would be ideal to find some interface that abstracts over raster and vector tiles given their cell-driven nature. Uses that come to mind:

vector tiles in zonal operations (significantly more efficient)
vector tiles as output of feature extraction on raster tile
vector tiles as per-tile metadata format for raster tile (allows to carry metadata post tile merge efficiently)
vector tiles to provide features in machine learning pipeline

Topology/TopoJSON support

Mentioned here: https://github.com/geotrellis/geotrellis/issues/1289

Specifically as we manipulate/re-project/simplify sets of geometries topology as a scala datatype would be very useful to avoid common pitfalls with gaps appearing on touching edges.

Machine Learning

Just want to echo the sentiment that our job is not to implement ML models but to provide a pipeline to train some models on imagery and then use trained models to segment the images and/or detect anomalies.

Ad-Hoc Raster Processing

Several features that fall into one category: What happens when two raster layers don't share a layout. This can happen in several ways:

Raster layers don't share the same cell size
Raster layers don't share the same tile layout, offset grids
Raster layers don't share the same grid resolutions, pyramid case

All three entail doing some kind of on the fly interpolation and specifying what resolution you want the output at. This is going to be critical for a robust ML pipeline because we will need to draw features from layers that exist for different reasons, covering slightly different areas at different resolutions. We can expect features to be actually a combination of multiple such mismatched layers.

This task is actually aided by work on vector spatial joins, because you're probably going to be down to working Feature[Raster[Tile]] types and doing a large spatial join to figure out which tile interests which.

Taking this to the logical conclusion it should allow us to do operations on rasters that have not been tiled at all, maybe rasters that have just been read from S3/HDFS/HTTP and at most chunked. We are certain that such operations will be slower than operations on pre-tiled datasets, but for ad-hoc analysis the time combined time of (tiling + operation) is very likely to be less than such an ad-hoc join.

It's unclear how valuable this feature is within itself, but it's pretty clear that it is partially required for work on ML pipelines and spatial join. I would say that framing that work in terms of such a general feature would produce better API and would also give us this useful feature as nearly a side-effect.

Raster Stream Processing

The direct feature here is a streaming ingest. The current situation is that if during Ingest the tiling shuffle breaks the memory allocation the process is very likely to crash. What would be ideal is if instead of crashing we are able to process the ingest incrementally, albeit slower. This obviously requires partial processing / persist cycles in the middle of ingest process. As a side effect this may make the ingest process recoverable since recording partial result is essentially checkpointing.

Can RDD mechanics always save us from this crash? I don't think so. Given a fixed cluster size and assuming perfect partitioning we will still run out of memory and eventually spill space while working on large enough shuffle. We essentially need a partial shuffle, which is very much an application level concept.

Overall this can also be treated as a problem of trading available computational resources for processing time, which is something that computers do all the time, we just need to do it in the distributed case.

Additionally stream processing lines up nicely with event notification pipeline: You may have some imagery coming in, you want to compare it with previously saved imagery, detect and changes, flag anomalies, record the data and issue notifications. Implied here is doing this in some kind of fixed resource environment.

Another critical aspect of supporting a stream case is that it allows for gradual scaling. Ability for a process to continue, allows for opportunity to inspect the backlog of work and scale the cluster without interruption. This is theoretically possible while executing a normal RDD operation, but I don't know of any mechanism to anticipate these "bulges" in processing pipelines which will cause the job to blowout.

It would be very helpful to have feedback on this issue in particular, I feel there are people who may have learned lessons here well ahead of us.

lossyrob commented 8 years ago

@timothymschier thanks for the input! Some responses

Good point about ingesting large files. I think this should make it into the roadmap for sure. We have some work going on for streaming GeoTiff reading, and that work will continue into reading geotiffs via HTTP and S3 byte range reads (a la vsicurl). Would this cover your use case? e.g. are your large mosaics that you are generating well formed GeoTiffs with internal tile/strip segmentation?
As far as the Python - I had a good talk with @shiraeeshi about the direction of the PR. I think we would benefit from having someone who is itching to get that feature give us some use cases to guide development. Do you think you could give us some insight on how you would like to use GeoTrellis from Python in comments on issue #1403? Would be much appreciated.
For vector implementation - GeoMesa and GeoWave would be storage/query mechanisms to hydrate RDDs of vector data, and while there is some Spark analytic capabilities in those frameworks, I think we're well suited to have a range of Vector, Vector -> Raster, Raster -> Vector RDD analytic capabilities in Spark, whether you store your vector data in GeoMesa, GeoWave, a backend we support, or vector tiles, etc.
Good point on using GeoTrellis as a processing engine to hook into other ML libs. I think we would want to focus development of CV and ML on GeoTrellis around specific use cases, however the aim would be to figure out the best way to have GeoTrellis aide ML pipelines that need to include geospatial data.

lossyrob commented 8 years ago

Another thing I would like to add to the roadmap:

Collection API

Currently we are able to read single tiles out of the various backends through ValueReaders. If we need to read out multiple tiles based on a query, we then need to read out an RDD. However, there are some use cases where you want to read out a number of tiles from a backend, where the number of tiles will definitely fit into memory on the executing machine. In this case, using RDDs and going through Spark is a lot of overhead. Ideally, we would have backend readers that would read out collections of tiles, which can then be worked with via the normal map algebra, etc methods.

lossyrob commented 8 years ago

We should also add MrGeo integration to the list.

fosskers commented 8 years ago

@lossyrob re: Collection API. Does that mean obfuscating RDD[(K,V)] with Metadata[M] behind some unified Geotrellis collection type? GeoSeq or otherwise.

lossyrob commented 8 years ago

@fosskers that obfuscation would lead to a lot of unfortunate consequences; for example, if we were to hide the fact that a GeoSeq was an RDD, we would lose all of the RDD methods on that instance, and being able to treat a TileLayerRDD[K] for example as a regular RDD is a convenience that I would be hard pressed to give up. So we get duplication with the Collections API being separate, but I think if we stick to conventions about what methods are available and how they are called we can at least have them seem like the same sort of type, without having to bake it into the type system.

pomadchin commented 8 years ago

Machine Learning

ML is one of things GT should support; mb even to provide some models to train, at least to demonstrate how GT works with ML frameworks / etc (to select fields on raster data); however that can be done in "demo" projects as well.

Raster Stream Processing

re to @echeipesh post: Does it mean a potential Kafka support? (i see that it's rather popular, GeoMesa and GeoWave examples)

lossyrob commented 8 years ago

Another item for the roadmap:

Integration with SFCurve

I think there's a lot of work that we could do on our SFC indexing, and I'd prefer to do that work collaboratively on the sfcurve project (https://github.com/locationtech/sfcurve). This would include a couple of changes to SFCurve:

moving from Longs to Array[Byte], which would make GeoMesa code like this (https://github.com/locationtech/geomesa/blob/cc97b68ef3b5a765be0fe213bdc4f9758b903f2a/geomesa-accumulo/geomesa-accumulo-datastore/src/main/scala/org/locationtech/geomesa/accumulo/index/z3/XZ3WritableIndex.scala#L76) change from having to use Longs.toByteArray to just using the byte array
Looking through the GeoMesa and GeoWave binning/periodicity code, respectivly, and the split/partitioning code, respectively, to decide how to best implement those features in sfcurve, putting that feature in, and having GeoTrellis implement those features

fosskers commented 8 years ago

How are those Array[Byte] used once calculated? Is arithmetic done on them, or are they essentially just DB row IDs?

locationtech / geotrellis

GeoTrellis Roadmap #1565

Vectors in Spark

Vector Tiles

Distributed vector operations

Computer vision / Machine Learning

Additional Raster capabilities

Datasets, DataFrames and Spark SQL

Python

Integrating with other open source projects

Reading Large Files

GeoTrellis Vector

SpatialJoins

VectorTiles

Topology/TopoJSON support

Machine Learning

Ad-Hoc Raster Processing

Raster Stream Processing

Collection API

Machine Learning

Raster Stream Processing

Integration with SFCurve