locationtech / geotrellis

GeoTrellis is a geographic data processing engine for high performance applications.
http://geotrellis.io
Other
1.33k stars 360 forks source link

Support a non-blocking Amazon S3 backend #2306

Open hectcastro opened 7 years ago

hectcastro commented 7 years ago

Currently, the existing Amazon S3 backend has a blocking API for various operations (AttributeStore, ValueReader, LayerQuery, etc.). This increases the complexity in integrating GeoTrellis with web services built on top of libraries like Akka HTTP that expect I/O operations within a request/response cycle to be non-blocking.

It is not entirely clear what the best approach to playing nicer with Akka HTTP is, but perhaps it means that classes like AttributeStore and ValueReader have non-blocking equivalents returning Futures.

lossyrob commented 6 years ago

This seems like a pretty heavy lift to think all the way through, and potentially an API break. We could run a parallel API that is async, though that may increase maintenance headaches and would have to be weighed.

Is there potentially some documentation that would solve this, based on the research into how best to use Executors etc in Raster Foundry? My understanding is that the big win here in the API would be to place blocking in the appropriate places; otherwise this wouldn't be much different than a client wrapping our calls in Futures how they want.

hectcastro commented 6 years ago

I agree that this is likely a big lift, and is mostly driven out of the desire to reduce the overhead of using GeoTrellis in a web context. The reason for creating this issue stems from an observation that other libraries we use within a web context in Scala have some form of support to not block the caller thread (Slick for database access, Akka's HTTP client, AWS SDK for Java 2.0 for AWS API calls, etc.).

I think we could address this with documentation by deemphasizing GeoTrellis' role in web/tile server contexts, or trying to be educate users about how the calls they'd likely be making in those contexts block. Trying to think about the bigger picture, I feel as though this could end up being a much more foundational decision point: How to provide first-class support for use cases that need/prefer Spark vs. those that need/prefer the Collections API? Their use cases seem different enough that they'd likely continue to create tension around API decision making.

lossyrob commented 6 years ago

Thanks for mentioning the other libraries, it gives something specific to look at to understand what we'd have to implement.

The other suggestions I'm not so clear on. What do you mean by deemphasizing GeoTrellis's role in those contexts? Do you have specific suggestions around that? For the education part, I think it would be a great contribution/knowledge sharing tool for people who hit up against these problems in RF to share what they've learned. You probably have the best idea of how to describe those issues, and who else felt that pain, so if there's some pointers on how to lift that knowledge to the broader community that would be appreciated.

This is the first thing I'm hitting on that is tension around API decision making wrt Spark and Collections API. There were some tensions about using Spark inside a web server, which the Collections API was meant to alleviate - but I don't think that falls into that category. Can you explain a bit more about what you're referring to, or what parts of the API you are seeing tension? The async issue doesn't seem to be to be a tension that has anything to do with Spark, as the ValueReader and LayerCollectionReader are not used in a Spark setting.

This won't make it into 1.2, but I'm happy to explore things for a 2.0 if there's better API to service the akka-http use case. If anyone has ideas on positive directions, please speak up!

hectcastro commented 6 years ago

The other suggestions I'm not so clear on. What do you mean by deemphasizing GeoTrellis's role in those contexts? Do you have specific suggestions around that?

To be clear, de-emphasis wouldn't be my desired approach. That was more one part of a suggested solution to your question about how to address the root of this issue via documentation. But along those lines, I think that I'd reconsider statements like the ones below which could lead potential users/evaluators down a path that runs counter to, or at the very least doesn't completely align with, things like having Spark as a dependency or blocking APIs for AttributeStore and the various readers and writers:

GeoTrellis includes a set of utilities to help developers create useful, high performing web services that load and manipulate raster data.

It aims to provide raster processing at web speeds (sub-second or less) with RESTful endpoints as well as provide fast batch processing of large raster data sets.

Note: Emphasis mine.

The other, more proactive part of a documentation based solution, might be to assemble a concrete set of use cases. For example, a web service project with guidance on how to avoid blocking the default dispatcher.


For the education part, I think it would be a great contribution/knowledge sharing tool for people who hit up against these problems in RF to share what they've learned. You probably have the best idea of how to describe those issues, and who else felt that pain, so if there's some pointers on how to lift that knowledge to the broader community that would be appreciated.

I honestly view these issues and the discussion around them as the first step toward that. We have run into issues, and we have applied workarounds, but it is still our first rodeo with most of it. At this point, I barely feel comfortable opening these issues, let alone drawing attention to the workarounds as any form of a gold standard.

That said, I don't think there would be any opposition to helping package some of the work Kelly is doing internally by reviewing it and publishing it as a use case for how to think about assembling a web service that plays well with Akka HTTP.


This is the first thing I'm hitting on that is tension around API decision making wrt Spark and Collections API. There were some tensions about using Spark inside a web server, which the Collections API was meant to alleviate - but I don't think that falls into that category. Can you explain a bit more about what you're referring to, or what parts of the API you are seeing tension? The async issue doesn't seem to be to be a tension that has anything to do with Spark, as the ValueReader and LayerCollectionReader are not used in a Spark setting.

I agree that this is one of the first things with regard to API decision making, but I also think that past tensions around combining Spark with a web service contribute to the same set of bigger picture concerns.

In general, I think the tension I'm sensing manifests itself in the fact that committing resources toward one set of use cases either takes away from, or doesn’t contribute to, others. For example, making the APIs we talk about above async (updating or adding new APIs, managing thread pools, providing proven default settings, surfacing tunable configuration, etc.) doesn’t seem to provide much benefit to more batch oriented use cases that need/prefer Spark. How feasible is it to try and satisfy both well? Does it make sense to de-emphasize or abandon one in order to satisfy the other better?

notthatbreezy commented 6 years ago

I think this is a great discussion -- and I think we are bumping into the question of "how to build a performant tile server with GeoTrellis" into Raster Foundry which is why some of this is coming up.

I agree that we're hesitant to necessarily say we've solved the problem or come up with an optimal solution, but I think it's not necessarily clear to us where changes in the library should be helping us or if we should build on top of it and perform workarounds. I think the case I can point to are how we use other libraries in a server context when they have blocking operations that are almost always IO based (slick, akka, net spy, hikari). These libraries help out by managing thread pools and providing tools and knobs to manage options based on server size, workload, etc. In GT we basically had to do that in the RF tile server. First we added blocking everywhere, but that didn't seem to be correct or easily maintainable. Then we started separating out thread pools for different types of IO we needed to perform (attribute store, collection reader, etc) and that has been working out better.

I'm not saying these should be in GeoTrellis, maybe documentation is the answer -- I don't know. It's not even always the case that issues here manifest themselves unless the server does start to be under load and when combined with other IO.

lossyrob commented 6 years ago

I personally don't have a lot of visibility into the issues or workarounds - others on GeoTrellis may, but I'm speaking from my perspective. The people hitting on the problems (yous) have that visibility. And Raster Foundry is a large user of our open source project...I would prefer that the GeoTrellis go towards solving the problems our users have under a shared vision of what this project is for. If we deemphasized or fell off trying to get better at the web server side of things, we'd be doing you (and a ton more use cases) a disservice. Same thing with the batch side. This is all to say, if there are ways to improve things, lets do it.

My original point of it being a heavy lift was to imply this won't make it into 1.2. That is, if we rely on the GeoTrellis committers to do it. If someone wants to push on it, even that will take some review time; but not impossible, and I guess dependent on what people want.

My other point, about maintenance burdens being weighed, is still true - but based on both your explanations about how this type of thing is necessary to support tile server use cases, I'm on the side that its a good thing to try and support (others may side differently).

I don't think this issue is so ground rattling as to try and change the vision or the course of this project on a whole. If there is a deep problem with the vision or direction of this project, I would suggest opening another issue around that.

Because you have the insight into the problems, and a potentially-maybe-not-perfect-working workaround that can inform someone like me who doesn't have deep knowledge of the problem, I would really appreciate if we could try to design this together, with it driven by the use case that has the issues. We could target 2.0 for this (as I have it marked). If it's something that is pressing enough to warrant a 1.3 and doesnt break API, that's another option.

echeipesh commented 6 years ago

While the tension between batch and async use-cases exist I don't think we can realistically dump one vs the other. Using batch process to transform and index data such that it can be made available for quick query and small region transformation is the basic pattern of most if not all applications using GeoTrellis.

Part of the friction is that this has been an emergent rather than designed-for pattern and the library shows this. Introduction of collections API and decoupling ValueReaders from SparkContext were some of the first steps but other remain:

The details of each one of those items aren't totally clear to me and will likely require some thought and iteration. At very least the challenges RF faced tuning the tile server are instructive for first attempt. Because this likely results pretty significant API breaks/additions I feel this is in 2.0 territory, but a priority feature.

hectcastro commented 6 years ago

Part of the friction is that this has been an emergent rather than designed-for pattern...

I think that this is a good way of framing it.

lossyrob commented 6 years ago

After the discussion, I think if we architect a solution and it ends up being a parallel API (and therefore not API breaking), and it would provide clear value to the RF team/would make working with GT easier for a chunk of work that needs done, i.e. not something that would be swapped out for a working workaround, I'd be in favor of getting a quick 1.3 our after 1.2 with these features included. This way we could get the priority feature out of the way and focus on some of the other big ticket items of 2.0.

echeipesh commented 6 years ago

Removing this from the 2.0 release after a couple of conversations on the team, there are couple of decisions that need to be made here that are likely to require testing and feedback before meeting upstream:

This issue has been around for a long time and it shouldn't be delayed. The best course of action here is to start the geotrellis-contrib project and repo and start iterating on it there.

pomadchin commented 6 years ago

Probably all collections / non spark related IO API can be wrapped into cats.IO and should be async by default without bad consequences. We can setup parallelism to 1 by default, in order to avoid some unexpected API behavior.

pomadchin commented 6 years ago

Also we may want to consider switching to https://github.com/aws/aws-sdk-java-v2