locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.34k stars 361 forks source link

Toward GeoTrellis 2.0 #2341

Closed fosskers closed 4 years ago

fosskers commented 7 years ago

<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Recent work with spin-off libraries, Raster Foundry, and GeoTrellis itself have made us aware that we're nearing a "major version precipice". We've been holding back a few PRs, knowing that they're necessary yet API breaking. And we've found a few quirks with the project structure, the canonical one being:

Why do I have to depend on spark-core just to get at SpatialKey?

So, if we need to tighten our overalls, hit the keyboard, and break the API, what's the best way to go about it? How can we get the most bang for our buck if we're going to be incrementing our major version? This doc serves as a launching point for that discussion and subsequent work.

1 geotrellis-core

@echeipesh and I had a conversation in early May when he was trying to find a home for a sum type which didn't trivially belong to either vector, raster, or spark. We thought it might be time to revive the idea of a separate geotrellis-core (or likewise named) package which would always be at the bottom of any dependency graph involving geotrellis.

More recently while working on a PR involving MapKeyTransform, certain intuitive seeming transformation methods weren't possible to write given the way that the various GT libraries depend on each other, for example:

Extent.toGridBounds: LayoutDefinition => GridBounds

A contributor @metasim has chimed in with his support for this idea.

In general, a geotrellis-core package would be a place for the most central types to live, regardless if the user is interested in vectors, rasters, vectortiles, or Spark.

1.1 Naming

Suggestions:

  • geotrellis-core (Cats/ScalaZ/Circe/Monocle convention)
  • geotrellis-types (Haskell convention)
  • geotrellis-common (Java convention)

1.2 Importing

/* To the point. Python/Cats/Circe/Haskell-like. Says "gimme it all, I trust you." */
import geotrellis._

/* True to the package name. */
import geotrellis.core._

1.3 Contents

I did a quick scan of the Scaladocs during my initial conversation with Eugene. I suggested the following be moved to a central package:

  • Boundable
  • Bounds
  • EmptyBounds
  • KeyBounds
  • LayerId
  • Metadata
  • SpatialKey
  • SpaceTimeKey
  • TemporalKey
  • TileLayerMetadata
  • All CRSs? (from proj4)
  • All KeyIndex types and indexing functionality
  • LayoutDefinition and friends
  • Extent

I'm likely missing some. If you notice any, please let me know and I'll update this list. Anything that can realistically be used regardless of vector/raster/spark could be here.

1.4 Migration

If we adopt import geotrellis._, then the migration guide may amount to:

Add import geotrellis._ to any source file that no longer compiles. You may be able to remove a number of imports that are no longer necessary.

2 Discoverability

This is an effort to address one of our team goals for the year - improving "discoverability" in GeoTrellis.

We often get questions like the following from both within Azavea and without:

How do I do <common GIS op> in GeoTrellis?

or more frustrating:

How do I import Foo in GeoTrellis?

or the most frustrating:

Q: I followed examples but my code just doesn't work. Help?

A: Probably missing implicits. Just import geotrellis.foo.bar.baz._ and you're good to go.

We can take a lesson from a library whose functionality is hard to discover and has nightmarish imports: ScalaZ.

A suggested definition of discoverable:

A library's API/functionality is discoverable if users can locate types, functions/methods, and usage examples without asking humans for help and without looking at that library's code directly.

Assumptions:

  • Poor discoverability in Scala libraries is not inevitable.
  • An API with smaller "surface area" is more discoverable.
  • API breaks in pursuit of smaller surface area are okay.

2.1 Reducing duplication

A number of our classes have heavily overloaded methods, where the inputs vary only slightly. Many of these overloads perform simple A -> B pure transformations before eventually calling down to the "real" form of the method. Here is ColorRamp.toColorMap as an example. Which is the "real" one?

Sets of overloads like these can make the call tree hard to follow for developers. Since they're also often undocumented, the question can arise in a user's mind, "what's the difference?" and they're forced to guess or ask us.

If these assumptions are true, would it be prudent to reduce the number of such overloads, putting the burden of the A -> B transformation onto the user? Exceptions can be made of course, one coming to mind being one of the ways to construct a Polygon:

Polygon((0,0), (0,1), (1,1), (1,0), (0,0))

Having to prefix each point with a Point constructor would get verbose. Then again, how often do we manually create Polygons like this, outside of tests?

Here are a few candidates for reducing duplicates.

  • PngRenderMethods.renderPng
  • MapKeyTransform methods
  • ColorRamp.toColorMap

2.2 Avoiding method injection where deemed unnecessary

We have a vast, undocumented mechanism for method injection involving 411 classes/traits. They follow the naming pattern *Methods and with*Methods. They can be considered a source of confusion, as they add a layer of indirection in determining the location of method definitions. This leads to confusion with imports, and slows down development.

A number of these sets of methods (likely not all) could live directly in the class they operate on. This would reduce a huge swath of symbols that appear in Scala docs, and would likely decrease compile time.

Examples:

  • ColorMethods

3 Easier SemVer iteration

If every symbol available in GeoTrellis is visible to the user through imports or documentation, then every symbol is "user facing", and every change has the potential to be API breaking. This makes it particularly hard to make sound calls regarding our versioning.

If we make a sweep of GT and draw a clear line in the sand between what is user-facing and what isn't, then whatever isn't can be hidden via private or stuffed in an internal subpackage (a Haskell convention that works well that I use in both VT and VP).

This should make it easier to do heavier changes without causing API breaks. This would allow us to increment versions easier and keep a regular release cycle. This also has the benefit of reducing API surface area, since many symbols will be made hidden from the Scaladocs.

@metasim

fosskers commented 7 years ago

There's a code example there for imports which didn't seem to survive the conversion from .org. Here is what I intended:

  /* To the point. Python/Cats/Circe/Haskell-like. Says "gimme it all, I trust you." */
  import geotrellis._

or

  /* True to the package name. */
  import geotrellis.core._
metasim commented 7 years ago

I raise my hand for geotrellis-core over geotrellis-types, as it leaves the cognitive door open for content that aren't just ADTs or are are commonly used pure functions.

fosskers commented 7 years ago

All typeclass instances for those types should also live in geotrellis-core. In GT's case this mainly means the JSON codecs.

Crap, speaking of which, I forgot to mention JSON. Maybe Circe migration should be 3.0?

@metasim I agree that core meaningfully implies more than types. I'm least enthused about common, which doesn't really have much semantic value other than "I'm a grab-bag package".

metasim commented 7 years ago

Avro codecs too?

lossyrob commented 7 years ago

Boundable Bounds EmptyBounds KeyBounds LayerId Metadata SpatialKey SpaceTimeKey TemporalKey TileLayerMetadata LayoutDefinition and friends

All this has to do with tiling and IO. Seems more like this should be broken out into an core io module, or something that has to do with tiling.

All CRSs? (from proj4)

We already have a subproject that has all the CRS's - proj4. There's also an issue to handle other projection libraries (https://github.com/locationtech/geotrellis/issues/1294). I was imagining the refactor here being we have a geotrellis.crs library, that can detect if another library is present and use it (e.g. using proj4j, proj.4 bindings or Apache SIS).

All KeyIndex types and indexing functionality

When we move to supporting SFCurve, this will carry an additional dependency, and be worth a project in it's own right (e.g. geotrellis-indexing)

Extent

This is a vector type, I think it belongs in geotrellis-vector.

If we carry through all the JSON and Avro, we adding a lot of dependencies to things that don't need it (e.g. raster doesn't need avro, and people user raster on it's own).

I think some sort of different breakdown is definitely needed. I'm not sure this core vs non-core representation is it.

fosskers commented 7 years ago

Seems more like this should be broken out into an core io module

A quote from @echeipesh from the original May conversation:

Ok, I think there is a set of functionality, working with tiled layouts. That is a common pattern for what we use, but it doesn’t actually have a lot to do with raster.

I agree that those types have a lot to do with tiling, but I'm not seeing the connection to IO. Surely those things are used by our IO backends (s3, etc), but the types themselves are GIS concepts that can be used in pure code that doesn't involve spark or the idea of data passing through a program. Related to that, it's always struct me as odd that our GeoJSON typeclass instances live in vector.io. Converting between JSON and data types in and of itself has nothing to do with IO, unless we're not sharing the same definition of IO?

We already have a subproject that has all the CRS's - proj4.

lazy val proj4 = project
  .settings(commonSettings)
  .settings(javacOptions ++= Seq("-encoding", "UTF-8"))

Cool, looks like proj4 is at the bottom of the dep graph too. Might not be a problem for it to stay that way, which means that geotrellis-core might even be able to depend on it.

When we move to supporting SFCurve, this will carry an additional dependency, and be worth a project in it's own right (e.g. geotrellis-indexing)

:+1:

[Extent] is a vector type, I think it belongs in geotrellis-vector.

@echeipesh said the same thing, but it's damn central to tiling work. I realize that there's a close relationship between Extent and Polygon, but Extent by itself isn't unique to working with geometries. I'd say it's as fundamental to any GeoTrellis work as SpatialKey is.

Furthermore, if Extent doesn't live in core then neither can LayoutDefinition, and then we quickly see the point of a core unravelling if half the things that should be in there can't be.

echeipesh commented 7 years ago

Overall agree with need to simplify the import burden for using GeoTrellis.

-core vs -tiling breakout

Otherwise the main drivers for the class shuffle have been the need to use Collections API without Spark dependency. In general this leads down the road of working with tiled rasters somewhere other than geotrellis-spark. The includes

All of these things can live pretty happily in geotrellis-raster but that does scope their re-use somewhat. In fact there is a lot of functionality that has to do with laying out either pixels or tiles over map surface and a lot of the classes @fosskers mentioned are the bottom of that class hierarchy. With exception of LayerId none of them have anything specifically to do with IO, thats just the place where they are most prominently used. The gut feeling is that it makes sense to home those in geotrellis-core rather than trying to find a very specific package name.

Codecs

Another pain felt is the lack of a good home for Avro/JSON codecs. The pain compounds when you consider library choices. Migrating to circe would be nice but cutting people loose from spray-json could also be problematic as it would force to all downstream applications to deal with swapping in circe which as we saw with Raster Foundry comes with its own gotchas. Similar story goes for Avro if you consider that ProtoBuf is at least one more valid choice.

The answer hopefully is to separate the GeoTrellis interfaces for specific library kind of like its done with Component, abstraction over lenses, and then break out the Avro and spray-json into their own packages.

fosskers commented 7 years ago

Right, duh, the collections api is a much stronger motivation for geotrellis-core than my SpatialKey woes. ;p

Orphan Instances

Re: where should json instances / avro codecs go?

We implement typeclasses in Scala with implicits. These are sometimes implicit class wrappings, and sometimes implicit vals which provide instances of classes needed by some external library functionality (like Encoder from Circe). If these implicits are defined anywhere but the companion object of the type they're tied to, they become "orphans".

Said another way, using Encoder as an example: if it is possible to import a type A while not importing its Encoder instance, then that instance is an orphan.

Claim: Orphans increase user cognitive overhead, specifically regarding imports.

Example from our FAQ regarding .toGeoJson.

Defining orphans leads to compiler warnings in languages with first-class typeclasses, so in Scala it's just something we have to be proactive about.

That said, does anyone foresee problems with migrating json / avro codecs to the companion objects of their types? This is how Circe instance auto-generation works - the instances are written into companion objects. The library breakout approach suggested by @echeipesh in his Codecs section above has value to be sure, although the fact remains that the instances themselves would be orphans.

moradology commented 7 years ago

As an alternative to stuffing would-be orphans into companions, we could have spray and circe alternative imports and stuff all required codecs into a single import object. Barring that, maybe we could also wrap the spray encoders/decoders we have with circe (did this a bit in RFML) to give nice operation within a circe application

fosskers commented 7 years ago

I just hit the orphan instance issue again. This won't compile:

import geotrellis.raster._
import geotrellis.spark._
import geotrellis.spark.io.s3._
import org.apache.spark._

// --- //

object Pipeline extends App {
  override def main(args: Array[String]): Unit = {
    implicit val sc: SparkContext = new SparkContext(
      new SparkConf().setMaster("local[*]").setAppName("pipeline-test")
    )

    val reader = S3LayerReader("geotrellis-test", "colingw/pipeline")
    val layer = LayerId("mask", 14)

    val rdd = reader.read[SpatialKey, Tile, TileLayerMetadata[SpatialKey]](layer)

    println(s"COUNT: ${rdd.count}")

    sc.stop()
    println("done")
  }
}

saying

could not find implicit value for evidence parameter of type spray.json.JsonFormat[geotrellis.spark.SpatialKey]

So having SpatialKey in scope is not enough to read a layer that uses them. Trial and error reveals this:

import geotrellis.spark.io._

to be the necessary extra import.

fosskers commented 7 years ago

An idea from @lossyrob yesterday, for some future version of GeoTrellis. Paraphrased:

We shouldn't wrap the JTS geometry types ourselves. All our functions should work on JTS geoms directly, with extra functionality added by injected methods only when necessary.

fosskers commented 7 years ago

^ Although it occurs to me now that this would force the user to import (and depend on!) JTS themselves when they want to reference the Geom types in their code's type signatures.

Workaround until Scala supports re-exports?

import blah.blah.jts

package object vector {
  type Point = jts.Point
  ...  // same for other geom types
}
metasim commented 7 years ago

@fosskers From where I stand I agree wholeheartedly with @lossyrob and really like the idea of using type aliases to simplify imports. Sounds like a winning approach to me.