ngageoint / mrgeo

MrGeo is a geospatial toolkit designed to provide raster-based geospatial capabilities that can be performed at scale. MrGeo is built upon Apache Spark and the Hadoop ecosystem to leverage the storage and processing of hundreds of commodity computers. See the wiki for more details.
https://github.com/ngageoint/mrgeo/wiki
Apache License 2.0
206 stars 64 forks source link

Support publishing new images in map algebra #372

Open djohnson729 opened 8 years ago

djohnson729 commented 8 years ago

When images are saved by MrGeo, both in map algebra and at the end of the "mapalgebra" command line, provide a way to publish the new image to one or more serving mechanisms (like GeoServer).

Suggested approach:

ericwood73 commented 8 years ago

I'll take this one

ericwood73 commented 8 years ago

So the SaveMapOp should always save the input mapop to the file, but should have an option to publish to a server as well? Would it make more sense to keep concerns separated and create a PublishMapOp instead? It could have optional arguments for REST url and json mapping that could default to whatever is in the conf file.

djohnson729 commented 8 years ago

As far as a separate publish function, the argument I have against that is that you can only publish saved images. So if we have a publish function, it would be a problem if the user tried to call publish on a raster result of a previous map algebra function like:

slope = elev.slope("deg")
slope.publish()

I don't think it would be practical to pass in a REST URI as a map algebra argument for publishing because depending on the service we're publishing to, it v=could be a POST or PUT and might require a payload as well. In addition, we would like to keep the map algebra syntax simple and approachable for data scientists, most of whom would not know which values to pass for those parameters anyway. That's why I think it is better left to an admin or devops type of person to set up in the MrGeo configuration.

ericwood73 commented 8 years ago

Understood. Is GeoServer the only example we have right now of an API for publishing images?

djohnson729 commented 8 years ago

Yes for now

ericwood73 commented 8 years ago

I've looked at the GeoServer API and want to understand the workflow a little better.

When saving an image there is an option to publish the image. In GeoServer terms this is equivalent to adding a "Coverage"? In order to publish an image or a shapefile, GeoServer requires a "Workspace" to be created. Would this workspace be configured at the MrGeo Instance level? Do we expect that the workspace would be the same for all images published from a MrGeo instance? Once the workspace is created, the image can be published to an existing Coverage Store or a Coverage Store can be created if it does not exist. Would we expect to have a single coverage store for the workspace? If not how would the coverage store be specified? Also is it envisioned that this feature would upload the file to the GeoServer, or request GeoServer to publish a file already on the server disk?

ericwood73 commented 8 years ago

What format would the data be in to be published? GeoServer supports a variety of Raster formats. From what I can tell, the save operation ultimately stores the image HDFS (or Accumulo) according to the PairRDDFunction.saveAsNewApiHadoopDataset function. This format doesn't appear to be usable by GeoServer directly so I assume that we would convert the raster in the RDD into some suitable format, such as GeoTiFF or a GDAL format. How would we save individual tiles? Each as it's own file? or do we aggregate the tiles back together somehow before saving?

djohnson729 commented 8 years ago

Take a look at https://github.com/ngageoint/mrgeo-geoserver-plugin for some more information. Tim wrote this a while back so that GeoServer could access MrGeo images directly. Specifically, take a look at MrGeoLayerUpdate.java first. That will probably answer some of your questions.

The idea behind that code is that it would run periodiccally in the background and refresh the list of layers in Geoserver based on the list of images it discovers in MrGeo's image base (from the the mrgeo.conf file). Sort of a "lazy" discovery mechanism. The downside of course is that new layers will not be immediately known by Geoserver, so this ticket is meant to update it right away. It will be important for our Python notebook capability so that customers could use Geoserver as their WMS and as soon as the "save" operation completes in map algebra within the notebook, the image could be displayed in the notebook via WMS queries from leaflet or something similar.

ttislerdg commented 8 years ago

Workspace and Coverage Store should probably be configured via mrgeo.conf properties. The raster type is "MrGeo", which is from our mrgeo-geoserver plugin.

ericwood73 commented 8 years ago

In addition to a boolean argument for whether or not to publish the image, I was thinking of an optional string argument that would indicate the publishing profile to use, e.g. "geoserver". This would allow for multiple publishing profiles to be supported for one or more services (with different endpoints, configuration, etc...) We'd have a publisher factory read the MrGeo configuration and system properties and configure the correct publisher based on the settings. We could also have a default profile and if publish is true, but no profile is specified, it will use the default. We could prefix any profile settings with "mrgeo.publisher.{profile name}" and we could have a mrgeo.publisher.defaultProfile={the default profile name}.

ericwood73 commented 8 years ago

In the suggested approach above, there would need to be a mrgeo.publisher.{profile name}.class property that points to the publisher class and a mrgeo.publisher.{profile name}.configuratorClass that points to a configurator that knows how to configure the publisher in order for a profile to be valid.

djohnson729 commented 8 years ago

We were thinking that mrgeo.conf could contain publishing settings for multiple services, and when the true flag is passed to "save", it would always publish to all of them.

ericwood73 commented 8 years ago

Is there a use case where you might have the same service with different endpoints and you only want to publish to one? For example you might have a Geoserver on AWS and one internally and you only want to publish some imagery to the one in the cloud? We could say the default is to publish to all but provide an optional list of profile names if there is a use case where they might not want to make every image available everywhere. Wouldn't be any extra effort to support a subset of all configured publishers.

djohnson729 commented 8 years ago

We don't have a use case for that right now, so I would say let's hold off on it until there is a need.

ericwood73 commented 8 years ago

Do you see any value in reading keywords from the configuration settings and settings and setting them for the published image? You would have the same keywords for every image, so I'm not sure how useful that would be, especially since we have the same workspace, namespace, and coverage store for every image.

ericwood73 commented 8 years ago

Actually all of the settings can be different for different profiles. While we publish to all profiles in the current implementation, it would be easy to support publishing to a subset of profiles (specified in the save command), which could be for the same endpoint, but with different settings.