twhiteaker / CFGeom

CF Convention for Representing Simple Geometry Types
MIT License
9 stars 4 forks source link

Incorporate a "curve" geometry type #26

Open bekozi opened 8 years ago

bekozi commented 8 years ago

WKT representations for curves and their derivative products can get complex. Below are some curve-based examples.

CIRCULARSTRING(1 5, 6 2, 7 3)
COMPOUNDCURVE(CIRCULARSTRING(0 0,1 1,1 0),(1 0,0 1))
CURVEPOLYGON(CIRCULARSTRING(-2 0,-1 -1,0 0,1 -1,2 0,0 2,-2 0),(-1 0,0 0.5,1 0,0 1,-1 0))
MULTICURVE((5 5,3 5,3 3,0 3),CIRCULARSTRING(0 0,2 1,2 2))

Curves primarily derive from "circular strings", so this would need to be supported as well. These WKT encodings do not fall outside what our reference implementation can support. The nested stuff for compound curves is a little different.

twhiteaker commented 8 years ago

In addition to the multipart_break_value flag, we'll need flags to indicate what each geometry type is as it is encountered. We'll also need a clear way of knowing when nested parts end. For example, consider this:

MULTICURVE(COMPOUNDCURVE(CIRCULARSTRING(0 0,1 1,1 0),(1 0,0 1)),(5 5,3 5))

That's a multicurve containing a compound curve and a line. The compound curve includes a circular string and a line. Between (1 0,0 1) and (5 5,3 5) we could insert a multipart_break_value flag, but it's ambiguous whether (5 5,3 5) is a new part of the multicurve or part of the compound curve.

To support these compound geometry types I'm wondering if we should instead follow a pattern like WKB. See the figure at the bottom of this page for an example of how a polygon is stored. In WKB, you start with a flag indicating the type of geometry, then you typically have a count of the number of its parts, and then you have the coordinates. In netCDF, that boxy figure linked above would be our coordinate index array.

This is a major change from what we've proposed and what Ben has implemented thus far, but not only am I struggling to think of a better way to handle compound geometries, I also gravitate toward reusing ideas other people have already thought a lot about and worked through.

Thoughts?

bekozi commented 8 years ago

Wow, interesting example. I don't have a position on adopting a WKB-like convention yet, but I would like to think more about the implications of a nested geometry like this. A few questions+comments.

  1. Do you consider this a "simple geometry"? The circle (CIRCULARSTRING) I would definitely consider one, but I'm not sure about MULTICURVE and COMPOUNDCURVE.
  2. How widely used are these curve geometries? I'm not saying they are unimportant - just curious if you have seem them used in practice. They seem modeling/simulation related. I have never seen them used except in internet discussions.
  3. Emulating WKB in netCDF-CF would make geometry encodings more general. However, it would also increase the spec complexity making it more difficult to write client code. I would hesitate to add too much bling for an edge case like this. We also seem to be recreating WKB with an indexing trick. There is a lot of software out there that can read/write WKB that maybe we can leverage to "simplify" WKB into something fitting the CF spirit.

This is a major change from what we've proposed and what Ben has implemented thus far, but not only am I struggling to think of a better way to handle compound geometries,

It is not too much of a change. What's important is that we have a series of unit tests to make sure the changes work! I do like the generality of the approach. If you are up for it, could you create a simple example for a multi-polygon using your proposed WKB-like method?

I also gravitate toward reusing ideas other people have already thought a lot about and worked through.

What we are doing here is very similar to UGRID so not too new. The WKB approach would be a pretty strong deviation from what's been done in CF I think.

twhiteaker commented 8 years ago
  1. In WKT terms, I consider point, linestring, and polygon to be simple.
  2. I don't know how much other people use curves. I use curves in lines and polygons in my GIS work. Folks outside of ArcGIS world probably don't because curves are not supported in shapefiles.
  3. I think our current implementation has a ceiling of WKT primitives and their multipart counterparts. The primitives along might even be enough for most netCDF users. If we want to add things like compound curves later on, I fear that instead of simply extending our work, someone will have to rewrite it. It's a question I struggle with: Is what we have sufficient or should be design for the future?

could you create a simple example for a multi-polygon using your proposed WKB-like method?

Yep, I'll do that.

twhiteaker commented 8 years ago

Here are a couple of examples in netCDF-3 using a WKB-like representation. Well-Known Text (WKT): MULTIPOLYGON (((30 20, 45 40, 10 40, 30 20)), ((15 5, 40 10, 10 20, 5 10, 15 5)))

WKB-like with coordinates in their own arrays

netcdf multipolygon_wkb_style {
dimensions:
    geom = 1 ;
    geom_node = 9 ;
    wkb_node = 17 ;
variables:
    int crs() ;
    double x(geom_node) ;
    double y(geom_node) ;
    int wkb_index(wkb_node) ;
        wkb_index:coordinates = "x y" ;
        wkb_index:stop_encoding = "cra" ;
        wkb_index:outer_ring_order = "anticlockwise" ;
        wkb_index:closure_convention = "last_node_equals_first" ;
    int wkb_index_stop(geom) ;
        wkb_index_stop:contiguous_ragged_dimension = "wkb_node" ;
data:

 x = 30, 45, 10, 30, 15, 40, 10, 5, 15 ;

 y = 20, 40, 40, 20, 5, 10, 20, 10, 5 ;

 wkb_index = 6, 2, 3, 1, 4, 0, 1, 2, 3, 3, 1, 5, 4, 5, 6, 7, 8 ;

 wkb_index_stop = 17 ;
}

The wkb_index explained: 6 - I am a multipolygon 2 - I have two polygons 3 - Polygon #1 says I am a polygon 1 - I have one ring 4 - My ring has four points 0 through 3 - Indexes of the points within the x and y arrays 3 - Polygon #2 says I am a polygon 1 - I have one ring 5 - My ring has five points 4 through 8 - Indexes of the points within the x and y arrays

WKB-like with coordinates in the index array (closer to WKB; Ethan will frown at me)

netcdf multipolygon_wkb_style {
dimensions:
    geom = 1 ;
    wkb_node = 26 ;
variables:
    int crs() ;
    int wkb_index(wkb_node) ;
        wkb_index:stop_encoding = "cra" ;
        wkb_index:outer_ring_order = "anticlockwise" ;
        wkb_index:closure_convention = "last_node_equals_first" ;
    int wkb_index_stop(geom) ;
        wkb_index_stop:contiguous_ragged_dimension = "wkb_node" ;
data:

 wkb_index = 6, 2, 3, 1, 4, 30, 20, 45, 40, 10, 40, 30, 20, 3, 1, 5, 15, 5, 40, 10, 10, 20, 5, 10, 15, 5 ;

 wkb_index_stop = 26 ;
}

The wkb_index explained: 6 - I am a multipolygon 2 - I have two polygons 3 - Polygon #1 says I am a polygon 1 - I have one ring 4 - My ring has four points 30, 20, 45, 40, 10, 40, 30, 20 - first polygon's coordinates 3 - Polygon #2 says I am a polygon 1 - I have one ring 5 - My ring has five points 15, 5, 40, 10, 10, 20, 5, 10, 15, 5 - second polygon's coordinates

twhiteaker commented 8 years ago

If there was a way to leverage existing WKB/WKT software, that might be very effective. It might not be accepted in the CF culture, though. At the May meeting, Ethan shot down the idea of storing geometries as WKT strings in an array, which is akin to what I showed in my second example above.

BobSimons commented 8 years ago

I remain a fan of using the existing WKT standard (and to a lesser extent WKB because of the lack of human readability). I still think that basically this project has largely (but not completely) duplicated WKT/WKB to create a non-standard way to encode WKT/WKB in a .nc file. We created code to take WKT/WKB into a .nc file, and code to generate WKT/WKB from the info in the file. It's still WKT/WKB, except not quite because it doesn't follow the standard 100%.

I said before: everything in a .nc file follows a standard encoding: We use IEEE 754 to convert bits to/from numbers. We use charsets to convert bits to/from characters. Ideally, we use ISO 8601(2004) extended format to convert bits (text) to date+times.

For me, the better solution is to add support for vlen arrays and compression. Then you can store WKT strings of different lengths and have compression reduce most of the verbosity of WKT vs WKB.

On Thu, Jul 28, 2016 at 1:52 PM, Tim Whiteaker notifications@github.com wrote:

If there was a way to leverage existing WKB/WKT software, that might be very effective. It might not be accepted in the CF culture, though. At the May meeting, Ethan shot down the idea of storing geometries as WKT strings in an array, which is akin to what I showed in my second example above.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bekozi/netCDF-CF-simple-geometry/issues/26#issuecomment-236021419, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOM55HmsWgnwSBEAkrLXGk-xKiL4jks5qaRZwgaJpZM4JIIcz .

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

twhiteaker commented 8 years ago

@ethanrd would you mind weighing in on this? Maybe it's because I'm a bit of an outsider with netCDF CF, but I like what Bob is proposing. I recall at our May meeting someone proposing the same thing, and you immediately came out against the idea. Would you mind elaborating on why you think this isn't a good idea?

bekozi commented 8 years ago

Many thanks, @twhiteaker, for the examples. It more or less makes sense how you've put the index arrays together. For human readability, it is more difficult to interpret as we now have positive integer values with different meanings depending on their index locations.

If we are going to go the route of duplicating, essentially fully, the WKB encoding method for geometry objects, then it may make little sense to create a verbose specification and reference implementation. The spec should say we use "WKB Specification version X supported by software API Y" to encode and decode WKB stored in netCDF. And, as @BobSimons discusses, we work to identify how to efficiently store the WKB (WKT) in netCDF 3 and 4. The WKB storage in netCDF seems more like an issue for Unidata engineers. I'm not sure how much help I (at least) can be for this.

Getting the CF community to accept the WKT/WKB storage method may be difficult, but if a strong case can be made that this is the correct approach, then who knows. @twhiteaker already said this.

Speaking strictly to the WKB encoding presented by @twhiteaker:

A nice CF feature has been how simple it is to write client applications interpreting it. I’m not sure if the WKB encoding method falls into this category. There is a fair amount of indexing to track, and extracting quick information on the contents of the coordinate index array is more complex. I think this may dissuade potential users from adopting it. However, it is not too bad and contains nice things up front like node counts and element/ring counts.

On the other hand, the WKB approach is nice because it will capture the full set of geometry types supported by WKB. The nested geometry types (compound curves) would require special consideration in a simple geometries specification that need not be added to the current WKB spec. Ellipses would need to be added somehow.

There seem to be 3 things that currently need to be decided – more or less:

  1. Modify draft spec / reference implementation to use WKB-like encodings.
  2. If 1, do we even write a reference implementation, or do we provide a bridge between netCDF storage and already available WKB software packages?
  3. If 2 and use currently available software, will this be accepted by the CF oversight committee?

I can see a number of advantages to 1. It’s overkill for very simple geometries, but it does provide room to grow for more complex cases. Where are others on this approach? @twhiteaker are you for it?

For me, the better solution is to add support for vlen arrays and compression. Then you can store WKT strings of different lengths and have compression reduce most of the verbosity of WKT vs WKB.

This would be very important for WKT (ASCII strings) as opposed to an encoded WKB string (usually hexadecimal). For the applications I work on, WKT would be very inefficient because of the complex polygons (probably fine when compressed but bad when loaded into memory). WKB would be fine. Continuous ragged arrays with index variables have supported the complex polygon storage with little issue (UGRID, ESMF unstructured).

My biggest issue with storing WKT and WKB directly is that we have essentially done little to reduce the work of geometry decoding. The coordinates are still locked in non-array things that get transformed into a geometry object by an external dependency. I still have to test for ordering, rings, etc. using a separate client or my own coordinate extraction utilitiy. If we go this route, I don't see how we've improved on providing a GUID link in a netCDF file to an external shapefile / spatial database/ geojson / etc.

@BobSimons it sounds like you are :-1: on what we've done so far? That's perfectly okay. I'm just trying to get a sense of where everyone is so we can start identifying how to move forward.

twhiteaker commented 8 years ago

I apologize for my second example. I accidentally left x and y values in there even though I removed the variables. I have now edited the comment to remove the values.

If the WKT/WKB approach would be acceptable to the CF community and software developers, I prefer it since it uses an existing standard. I just don't know if what we're proposing would be viewed as clever or ludicrous. Also, but ellipses.

If we don't go with WKT/WKB, then I think we should limit our scope to [multi]point|line|polygon (and maybe circles and ellipses since there seems to be some interest in that). When we start thinking about curves, it seems like with the addition of more flags we are coming around to reinventing WKB anyway. So if 'we not going with WKT/WKB from the get-go, then I think we should strive to keep it simple.

Does anyone have suggestions on how we can get a sanity check on the WKB idea from Unidata and big CF players with enough time before AGU to actually provide an example implementation? From the May meeting, @hrajagers expressed interest in this topic; can we get his input? And we were supposed to recruit Denis Nadeau.

My biggest issue with storing WKT and WKB directly is that we have essentially done little to reduce the work of geometry decoding.

Would Unidata or other netCDF developers have a problem folding in existing WKT libraries in C, Java and Python? I don't know how good the Java and C libraries are. Shapely (Python) can only handle the simple geometry types I think.

bekozi commented 8 years ago

If the WKT/WKB approach would be acceptable to the CF community and software developers, I prefer it since it uses an existing standard. I just don't know if what we're proposing would be viewed as clever or ludicrous. Also, but ellipses.

Would Unidata or other netCDF developers have a problem folding in existing WKT libraries in C, Java and Python? I don't know how good the Java and C libraries are. Shapely (Python) can only handle the simple geometry types I think.

There is a straight WKT/WKB solution that is available in netCDF4. @dblodgett-usgs (I think @BobSimons was also involved) and I spoke with @lesserwhirls (Sean Arms) about using the BLOB data type to store WKT/WKB directly. @lesserwhirls went as far as suggesting we could dynamically link GDAL or GEOS to add some metadata about the geometries when using utilities like ncdump. Again this is only for netCDF4. NetCDF3 would require a different solution.

The BLOB data type would need to be exposed in the respective Python and R libraries. Keep in mind that this approach will not support Fortran applications - a strong consideration for CF datasets.

Does anyone have suggestions on how we can get a sanity check on the WKB idea from Unidata and big CF players with enough time before AGU to actually provide an example implementation?

I would suggest drafting a description of the approach and sending to the CF Metadata email list.

ethanrd commented 8 years ago

@twhiteaker I know I objected at the meeting to encoding CRS information as WKT because there is an existing, though limited, CF method for encoding this information. I'm not quite sure what I said about WKT/WKB for geometry information. But my current thinking (perhaps influenced by a conversation with @BobSimons at the recent ESIP mtg) is that we should definitely use the existing standard rather than develop a partial re-creation. However, keep in mind that CFs founding principles include simplicity, self-describing (e.g., no numeric codes), and human readabe/understandable (see [1], page 2). Given that, here are some thoughts on some of the options mentioned:

[1] http://cfconventions.org/Data/cf-documents/overview/article.pdf

bekozi commented 8 years ago

Many thanks for the input @ethanrd. If you don't mind another question, we have developed a somewhat controversial method to mimic WKT in something that is CF-like (WKT in and out for testing). There are negative integer flags in coordinate index arrays (UGRID-like) to indicate breaks in multi-geometries and starts of coordinates for holes/interiors in polygons. Would these be considered okay? Would they need to be replaced by start/stop/counts? An example is below for a multi-polygon with holes/interiors. If you want to see other examples of this approach, they are here: https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/Examples---VLen-Ragged-Arrays.

Well-Known Text (WKT): MULTIPOLYGON(((0 0, 20 0, 20 20, 0 20, 0 0), (1 1, 10 5, 19 1, 1 1), (5 15, 7 19, 9 15, 5 15), (11 15, 13 19, 15 15, 11 15)), ((5 25, 9 25, 7 29, 5 25)), ((11 25, 15 25, 13 29, 11 25)))

Common Data Language (CDL):

netcdf _ncsg_describe_ {
types:
  int64(*) geom_VLType ;
dimensions:
  node = 25 ;
  geom = 1 ;
variables:
  geom_VLType coordinate_index(geom) ;
    string coordinate_index:geom_type = "multipolygon" ;
    string coordinate_index:coordinates = "x y" ;
    coordinate_index:multipart_break_value = -1LL ;
    coordinate_index:hole_break_value = -2LL ;
    string coordinate_index:outer_ring_order = "anticlockwise" ;
    string coordinate_index:closure_convention = "last_node_equals_first" ;
  double x(node) ;
  double y(node) ;
data:

 coordinate_index = 
    {0, 1, 2, 3, 4, -2, 5, 6, 7, 8, -2, 9, 10, 11, 12, -2, 13, 14, 15, 16, -1, 17, 18, 19, 20, -1, 21, 22, 23, 24} ;

 x = 0, 20, 20, 0, 0, 1, 10, 19, 1, 5, 7, 9, 5, 11, 13, 15, 11, 5, 9, 7, 5, 
    11, 15, 13, 11 ;

 y = 0, 0, 20, 20, 0, 1, 5, 1, 1, 15, 19, 15, 15, 15, 19, 15, 15, 25, 25, 29, 
    25, 25, 25, 29, 25 ;
}
ethanrd commented 8 years ago

@bekozi I like this approach. The multipart_break_value and related attributes give it something of a CF feel, sidestepping to some degree the whole "no numeric codes" issue.

(Is this approach controversial because it is encoded into netCDF-CF data structures rather than a String/Blob that can be directly fed to WKT/WKB reading software?)

I agree with your suggestion of writing up some of the suggested methods for encoding WKT/WKB information and asking for input from the CF metadata email list.

bekozi commented 8 years ago

The multipart_break_value and related attributes give it something of a CF feel, sidestepping to some degree the whole "no numeric codes" issue.

@ethanrd, good to hear this is inline with what CF might accept.

(Is this approach controversial because it is encoded into netCDF-CF data structures rather than a String/Blob that can be directly fed to WKT/WKB reading software?)

I think there are a couple things controversial and/or concerning about an approach like this. I'll try to summarize. @twhiteaker / @BobSimons please expand.

I personally have reservations about all three approaches. I prefer the method with the negative integer codes because it (1) is CF-like and easy to interpret for the simplest cases (having no multi-geometries or holes/interiors), (2) has a near equivalent representation in WKT, and (3) can be "nested" to create the complex geometries described above. The main drawback to the negative integer approach is creating another standard requiring maintenance and new software APIs.

I agree with your suggestion of writing up some of the suggested methods for encoding WKT/WKB information and asking for input from the CF metadata email list.

Yeah, I think it makes sense to get feedback on our internal discussion so far.

twhiteaker commented 8 years ago

I personally have reservations about all three approaches

Me too. I really like the idea of using an existing standard, but I understand the desire to keep CF simple.

I propose we create a wiki page to serve as a space for drafting our letter to CF. I'm happy to get the page going, but I'd like to nail down what we're proposing a little better first.

As @bekozi mentioned, storing blobs is problematic for Fortran. You'll also have netCDF 3 issues. So I suggest we do not propose storing WKB as blob. That leaves us with Ben's approach, the WKB-like approach with coordinates in the index array, and straight WKT. All three approaches rely on our VLEN proposal for netCDF-3. (btw, I took out stop_encoding from the VLEN write-up)

The main questions I would have for the CF list are:

  1. Is our VLEN netCDF-3 approach acceptable?
  2. Is a scope of [multi]point|line|polygon + ellipses OK (Ben's approach), or should we design so that we can accommodate all WKT types (WKB-like or WKT approach)?
  3. If we plan to support WKT types, do you prefer WKB-like or WKT approach?

Let me know if that sounds like a good plan, and I'll start writing.

bekozi commented 8 years ago

@twhiteaker, my apologies for the slow response. Your question list looks good, and thanks for offering to start the initial draft. Ping again when it is ready to be edited. I opened a new ticket (#29) for this mail as we are strongly deviating from the original curve discussion.