twhiteaker / CFGeom

CF Convention for Representing Simple Geometry Types
MIT License
9 stars 4 forks source link

how to indicate polygon holes in CF #16

Closed twhiteaker closed 8 years ago

twhiteaker commented 8 years ago

The CDL example in #13 stores indicators of polygon holes within the same array that stores polygon node lats or lons. Let's call this the PR13 way. I'm a little uncomfortable with this because we're mixing coordinates and flags into a single variable.

Another way to do this could be to use CF ancillary variables, and specifically, a flag variable, to indicate holes, leaving the node array to store only node coordinates.

PR13 pros:

PR13 cons:

Ancillary variable pros:

Ancillary variable cons:

To compare each of these approaches, I've created CDL examples below. I'm representing two geometries; here's the WKT: POLYGON ((0 0, 6 0, 3 6, 0 0), (2 1, 3 2, 4 1, 2 1)) POLYGON ((4 6, 7 0, 10 6, 4 6))

You can feed this website the following multipolygon and select WKT as the format if you want to see those geometries. MULTIPOLYGON (((0 0, 6 0, 3 6, 0 0), (2 1, 3 2, 4 1, 2 1)), ((4 6, 7 0, 10 6, 4 6)))

For the examples, I started with the CDL in #13 and added the contiguous_ragged_dimension attribute from https://github.com/bekozi/netCDF-CF-simple-geometry/issues/11#issuecomment-224442244.

PR13 example

Note how the hole flag, -2, is found in both the polyLat and polyLon data.

netcdf example {
  dimensions:
    polyNodes = 13 ;  // number of nodes plus flags to indicate holes
    polygons = 2 ; // number of features
  variables:
    int crs() ;
    int polyLat(polyNodes) ;
        polyLat:standard_name = "polygon y node" ;
        polyLat:grid_mapping = "crs"
    int polyLon(polyNodes) ;
        polyLon:standard_name = "polygon x node" ;
        polyLon:grid_mapping = "crs"
    int polyNodeStop(polygons) ;
        polyNodeStop:contiguous_ragged_dimension = "polyNodes" ;
    // global attributes:
    :Conventions = "CF-1.8" ;
  data:
polyLat = 0, 0, 6, 0, -2, 1, 2, 1, 1, 6, 0, 6, 6 ;
polyLon = 0, 6, 3, 0, -2, 2, 3, 4, 2, 4, 7, 10, 4 ;
polyNodeStop = 9, 13 ;
crs = _ ;
}

Ancillary variable example

In this example, I use flags to indicate the start of each polygon, which isn't necessary for this simple case but may be useful later when we play with multipolygons, geometry collections, etc. The flag value for the start of each polygon comes from Table 7: Integer codes for geometric types from the OGC simple feature access spec. I also have a hole flag.

netcdf example {
  dimensions:
    polyNodes = 12 ;  // number of nodes
    polygons = 2 ; // number of features
  variables:
    int crs() ;
    int polyLat(polyNodes) ;
        polyLat:standard_name = "polygon y node" ;
        polyLat:grid_mapping = "crs"
        polyLat:ancillary_variables = "polyNodeFlag" ;
    int polyLon(polyNodes) ;
        polyLon:standard_name = "polygon x node" ;
        polyLon:grid_mapping = "crs"
        polyLon:ancillary_variables = "polyNodeFlag" ;
    int polyNodeStop(polygons) ;
        polyNodeStop:contiguous_ragged_dimension = "polyNodes" ;
    int polyNodeFlag(polyNodes) ;
        polyNodeFlag:long_name = "Well-known Text Flag" ;
        polyNodeFlag:standard_name = "wkt_flag" ;
        polyNodeFlag:_FillValue = -9999 ;
        polyNodeFlag:valid_range = -2, 3 ;
        polyNodeFlag:flag_values = -2, 3 ; // positive values from OGC 06-103r4
        polyNodeFlag:flag_meanings = "hole Polygon" ;
    // global attributes:
    :Conventions = "CF-1.8" ;
  data:
polyLat = 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6 ;
polyLon = 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4 ;
polyNodeStop = 8, 12 ;
polyNodeFlag = 3, _, _, _, -2, _, _, _, 3, _, _, _ ;
crs = _ ;
}

Questions for the team:

  1. Am I missing use cases that kill the ancillary variable approach?
  2. Do you have a sense of how well each approach would perform?
  3. Other thoughts?
twhiteaker commented 8 years ago

btw I used int for lat and lon in the examples just to avoid decimal clutter.

bekozi commented 8 years ago

This is looking good. Let me try and work up an example in the near-term of the data structure cf.loads would expect. The major difference from this is that an index variable into polyLat and polyLon would contain the flags and the poly* variables would contain coordinates only. No risk of coordinates and flags having the same values. The grid_mapping attribute would only exist on the index variable and not on the coordinate variables. This is consistent with UGRID.

Following from your case here, the proposed polygonCoordinateIndex variable would have dimension (polygons, polyNodes).

I definitely like the CF approach to defining flags though unsure if this would give us unnecessary overhead.

bekozi commented 8 years ago

As an example, the data structure used for cf.loads looks like:

netcdf _ncsg_test_output_ {
types:
  int64(*) geom_VLType ;
dimensions:
    geom = 1 ;
    node = 12 ;
variables:
    geom_VLType coordinateIndex(geom) ;
        string coordinateIndex:geom_type = "polygon" ;
        string coordinateIndex:coordinates = "x y" ;
        coordinateIndex:multipart_break_value = -1L ;
        coordinateIndex:hole_break_value = -2L ;
    double x(node) ;
    double y(node) ;
data:

 coordinateIndex = {0, 1, 2, 3, -2, 4, 5, 6, 7, -1, 8, 9, 10, 11} ;

 x = 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4 ;

 y = 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6 ;
}

This is for the WKT: MULTIPOLYGON (((0 0, 6 0, 3 6, 0 0), (2 1, 3 2, 4 1, 2 1)), ((4 6, 7 0, 10 6, 4 6))). The code used to generate this is here: https://github.com/bekozi/netCDF-CF-simple-geometry/blob/i7-dumps/src/python/ncsg/test/test_ncsg/test_cf.py#L108.

This is by no means intended to be complete with accurate names, etc. This is essentially the format used for the NHDPlus catchment polygon conversion so I can vouch that its scalable. The mesh format we were using collapsed the ragged array into 1D and added stops similar to polyNodeStop. I believe that format would fall into the continuous ragged array category.

I am :+1: on a format like this, but that kinda goes without saying...

twhiteaker commented 8 years ago

Would you mind showing an example where geom = 2 please?

bekozi commented 8 years ago

@twhiteaker Yes, that would have been useful. See below.

netcdf _ncsg_test_output_ {
types:
  int64(*) geom_VLType ;
dimensions:
    geom = 3 ;
    node = 24 ;
variables:
    geom_VLType coordinateIndex(geom) ;
        string coordinateIndex:geom_type = "polygon" ;
        string coordinateIndex:coordinates = "x y" ;
        coordinateIndex:multipart_break_value = -1L ;
        coordinateIndex:hole_break_value = -2L ;
    double x(node) ;
    double y(node) ;
data:

 coordinateIndex = {0, 1, 2, 3, -2, 4, 5, 6, 7, -1, 8, 9, 10, 11}, 
    {12, 13, 14, 15, -2, 16, 17, 18, 19}, {20, 21, 22, 23} ;

 x = 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4, 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4 ;

 y = 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6, 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6 ;
}

This is for WKT:

MULTIPOLYGON (((0 0, 6 0, 3 6, 0 0), (2 1, 3 2, 4 1, 2 1)), ((4 6, 7 0, 10 6, 4 6)))
POLYGON ((0 0, 6 0, 3 6, 0 0), (2 1, 3 2, 4 1, 2 1))
POLYGON ((4 6, 7 0, 10 6, 4 6))
twhiteaker commented 8 years ago

I reworked the nc3 ancillary variable example from above to include the three WKT geometries in https://github.com/bekozi/netCDF-CF-simple-geometry/issues/16#issuecomment-225245800. This scheme DOES NOT support compound geometries, e.g., GeometryCollection, MultiCurve, CompoundCurve, and CurvePolygon. For these types, you would need to know when the current geometry is ending (could be several if nested) and the type of the new geometry.

netcdf nc3_and_flags {
  dimensions:
    node = 24 ;  // number of nodes
    geom = 3 ; // number of features
  variables:
    int crs() ;
    double y(node) ;
        y:standard_name = "polygon y node" ;
        y:ancillary_variables = "geom_node_flag" ;
    double x(node) ;
        x:standard_name = "polygon x node" ;
        x:ancillary_variables = "geom_node_flag" ;
    int geom_stop(geom) ;
        geom_stop:contiguous_ragged_dimension = "node" ;
        geom_stop:grid_mapping = "crs" ;
        geom_stop:outer_ring_order = "anticlockwise" ;
        geom_stop:closure_convention = "last_node_equals_first" ;
    int geom_node_flag(node) ;
        geom_node_flag:long_name = "Well-known Text Flag" ;
        geom_node_flag:standard_name = "wkt_flag" ;
        geom_node_flag:_FillValue = -9999 ;
        geom_node_flag:valid_range = -2, 6 ;
        geom_node_flag:flag_values = -2, 3, 6 ; // positive values from OGC 06-103r4
        geom_node_flag:flag_meanings = "hole geometry_end Polygon MultiPolygon" ;
    // global attributes:
    :Conventions = "CF-1.8" ;
  data:
y = 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6, 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6 ;
x = 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4, 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4 ;
geom_stop = 12, 20, 24 ;
geom_node_flag = 6, _, _, _, -2, _, _, _, 3, _, _, _, 3, _, _, _, -2, _, _, _, 3, _, _, _ ;
crs = _ ;
}

Same example data, but this time as nc3 CRA which more closely matches Ben's example with the addition of type of the geometry starting at the current index value. This could support compound geometries if we add flags for when a geometry is ending.

netcdf nc3_index_cra {
  dimensions:
    geom = 3 ; // number of features
    node = 24 ;  // number of nodes
    nodes_and_flags = 31; // number of nodes plus indicators of geometry types, holes, etc.
  variables:
    int crs() ;
    double y(polyNodes) ;
        y:standard_name = "polygon y node" ;
    double x(polyNodes) ;
        x:standard_name = "polygon x node" ;
    int coordinate_index(nodes_and_flags) ;
        coordinateIndex:geom_types = "MultiPolygon Polygon" ;  // two geom types here
        coordinate_index:coordinates = "x y" ;
        coordinate_index:grid_mapping = "crs" ;
        coordinate_index:hole_break_value = -2 ;
        coordinate_index:multipart_break_value = -1 ;
        coordinate_index:MultiPolygon_break_value = -6 ;
        coordinate_index:Polygon_break_value = -3 ;
        coordinate_index:outer_ring_order = "anticlockwise" ;
        coordinate_index:closure_convention = "last_node_equals_first" ;
    int geom_stop(geom) ;
        geom_stop:contiguous_ragged_dimension = "nodes_and_flags" ;
    // global attributes:
    :Conventions = "CF-1.8" ;
  data:
coordinate_index = -6, 0, 1, 2, 3, -2, 4, 5, 6, 7, -1, 8, 9, 10, 11, -3, 12, 13, 14, 15, -2, 16, 17, 18, 19, -3, 20, 21, 22, 23 ;
y = 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6, 0, 0, 6, 0, 1, 2, 1, 1, 6, 0, 6, 6 ;
x = 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4, 0, 6, 3, 0, 2, 3, 4, 2, 4, 7, 10, 4 ;
geom_stop = 16, 26, 31 ;
crs = _ ;
}
bekozi commented 8 years ago

Sorry for not responding sooner. I got a summer cold early this week... Thanks @twhiteaker for the additional examples!

I reworked the nc3 ancillary variable example from above to include the three WKT geometries in https://github.com/bekozi/netCDF-CF-simple-geometry/issues/16#issuecomment-225245800. This scheme DOES NOT support compound geometries, e.g., GeometryCollection, MultiCurve, CompoundCurve, and CurvePolygon. For these types, you would need to know when the current geometry is ending (could be several if nested) and the type of the new geometry.

My turn to be dense. What is the argument for the auxiliary variable approach? It looks more cumbersome than our other proposed approaches and requires a mostly empty masked variable in geom_node_flag. The coordinate index approach also allows coordinate data to be "packed" (i.e. de-duplication of nodes by sharing indices).

coordinate_index:multipart_break_value = -1 ; coordinate_index:MultiPolygon_break_value = -6 ; coordinate_index:Polygon_break_value = -3 ;

This brings up a question about how we want to indicate geometry types. Basically, do we want to have "Point" and "MultiPoint"? "Polygon" and "MultiPolygon"? Right now, the reference implementation only uses "point", "line", and "polygon". A multi-geometry is flagged using the multi-part break value. Setting the break value to None tells the converter not to search for or create multi-geometries. Adding "MultiPoint", etc. is not really necessary but does bring us more in compliance with OGC. Thoughts?

I am also inclined to go with a single geometry type per coordinate index variable. This is similar to how a shapefile works. Basically, if there is a single multi-polygon in a collection then the entire collection is multi-polygon even though a number of the geometries will only have a single polygon. Doing this should remove the need for MultiPolygon_break_value and Polygon_break_value and only require the geom_type attribute. Again - thoughts?

twhiteaker commented 8 years ago

With ancillary variables, I was attempting to leverage an existing CF+nc3 pattern. I needed to look at this example to really make an assessment. My conclusion is that, while it would support our simple examples (point, line, polygon), it breaks for more complex geometries, and I could see people wanting to use multipolygons like the shape of Hawaii. Plus all the reasons @bekozi stated. Plus my second example more intuitively matches the nc4 approach. So, I'm finished with the ancillary exercise.

Adding "MultiPoint", etc. is not really necessary but does bring us more in compliance with OGC.

I would keep the focus on our anticipated use cases, which I think means line, polygon, and maaayyybe multipolygon, while being mindful of the other geometries for possible future extensions. With our current scheme, I think supporting other geometries would be as simple as defining some new break values and how to interpret them, so I'm comfortable with where we are.

I am also inclined to go with a single geometry type per coordinate index variable.

Agreed. This seems in line with the CF community's thinking on mixing geometry types since CF allows only a single DSG type in an entire file.

Doing this should remove the need for MultiPolygon_break_value and Polygon_break_value and only require the geom_type attribute.

Is that true? If we just have multipart_break_value then how does software know whether the next part in a multipolygon is a hole in the first polygon or the start of a second polygon? Or does software even care? It could be that software just stores sequential rings and figures out what to do with them on display, in which case @bekozi is correct. I'll see if I can tease out what ArcGIS does under the hood for an example. Wow this is down in the weeds.

twhiteaker commented 8 years ago

I made a JSON file with a multipolygon with two polys and a hole, and there was no indication of when something was a hole vs a new polygon, and ArcGIS didn't care. I even shuffled those three rings around, one time with the hole first, and ArcGIS didn't care. Maybe it just uses ring order (CCW vs CW) or does some other magic, but it appears that @bekozi is right about not needing the polygon break values for this case. Does anyone know if other software cares?

I think we're close to closing this one.

bekozi commented 8 years ago

My conclusion is that, while it would support our simple examples (point, line, polygon), it breaks for more complex geometries, and I could see people wanting to use multipolygons like the shape of Hawaii.

I agree. It was good to investigate and discuss regardless. I was not aware of the approach until this exchange.

I would keep the focus on our anticipated use cases, which I think means line, polygon, and maaayyybe multipolygon, while being mindful of the other geometries for possible future extensions.

Yeah, eyes on the prize! I think adding multipoint, etc. makes sense and eliminates the "multipart_break_value means multi-geometries only if its present and found in the index variable". I added a ticket #21. No reason to deviate from established geometry types to avoid some additional string flags.

Agreed. This seems in line with the CF community's thinking on mixing geometry types since CF allows only a single DSG type in an entire file.

:+1:

Is that true? If we just have multipart_break_value then how does software know whether the next part in a multipolygon is a hole in the first polygon or the start of a second polygon? Or does software even care?

In the reference implementation, ordering is important. All holes must be preceded by the definition of the containing polygon's exterior: https://github.com/bekozi/netCDF-CF-simple-geometry/blob/caf66845e8964a7637c450228674bdb84d2b57e3/src/python/ncsg/cf.py#L31. For multi-polygons the parsing sequence is:

  1. Split coordinate indices by the multipart_break_value. Each split is a polygon element of the multi-geometry and still includes any holes/interiors.
  2. Split the polygon indices into subplits by the hole_break_value. The first subsplit is the polygon exterior. The remaining subsplits are the holes.

You end up with some funky recursion-like stuff, but it works: https://github.com/bekozi/netCDF-CF-simple-geometry/blob/caf66845e8964a7637c450228674bdb84d2b57e3/src/python/ncsg/cf.py#L57-68. See also the ragged array example: https://github.com/bekozi/netCDF-CF-simple-geometry/issues/16#issuecomment-225245800.

I even shuffled those three rings around, one time with the hole first, and ArcGIS didn't care.

Interesting! Must be the ordering in that case? Like you suggested.

Does anyone know if other software cares?

I don't have enough experience with low-level GIS software implementations to know. I think as long as our specification produces reliable WKT-like representations of holey polygons we should be okay. Just looking to hit the lowest common denominator.

I think we're close to closing this one.

I think so!

twhiteaker commented 8 years ago

I finally read all the way through cf.py and poked around shapely docs. I see that, unlike ArcGIS, shapely DOES need to differentiate holes and multiparts upon polygon instantiation. I think we should support shapely in this regard. So, I believe the answer to this issue is:

For coordinate index variables with attribute geom_type = "polygon", if holes are present in one or more polygon features, the coordinate index values shall include a negative integer flagging the start of each hole. In this case, the variable shall include a 'hole_break_value' to indicate the value of the flag.

Also, not answering this question but useful for wiki:

For multipart geometries, the coordinate index variable shall include a negative integer flagging the start of each new part to the current feature with the exception of the first part. The variable shall include an attribute named multipart_break_value to indicate the value of the flag.

Of course, this is the conclusion that @bekozi had already reached and I'm just catching up!

If this looks good, let's close the issue and add to wiki.

bekozi commented 8 years ago

I think we're good! I inserted @twhiteaker's text snippets in the wiki more or less as they were above. We may want to consider using the CF flag specification like @twhiteaker used in some examples. I find it cumbersome having to split the string and convert the values, so I'm not really a fan. Anyway, just good to keep in mind.