twhiteaker / CFGeom

CF Convention for Representing Simple Geometry Types
MIT License
9 stars 4 forks source link

Finalize Initial Proposal #48

Closed dblodgett-usgs closed 7 years ago

dblodgett-usgs commented 7 years ago

At AGU, the feedback was generally good. Use of an indexed array to relate nodes to coordinates was well accepted because of its consistency with UGRID and support for shared coordinates. The way that Well Known Text break values are used in conjunction with the indexed array seemed odd to some people though. Without prior knowledge of WKT, the break values are foreign and the dual roll of the coordinate index seems complicated.

While it requires the addition of another variable, I think it does clean up the data model to move the break value information into it's own variable akin to what Jonathan suggested. Assuming we have a coordinate index, we would need two other variables; one to indicate the type of each part and another to indicate the node start (or stop or count) of each part.

A structure that follows this logic, as I suggested in a previous thread, would look like this:

netcdf sample_poly {
dimensions:
    geom = 2 ;
    part = 3 ;
    part_nodes = 15 ;
    coordinates = 14 ;
variables:
    int geom_name(geom) ;
        geom_name:cf_role = "geom_id" ;
    char part_type(part) ;
        part_type:long_name = "g for new geometry, p for new part, h for hole." ;
    int part_start(part) ;
        part_start:long_name = "start position of each part" ;
    int node_index(part_nodes) ;
        node_index:long_name = "index into x / y coordinate data" ;
        node_index:geom_coordinates = "x y" ;
        node_index:geom_dimension = "geom" ;
        node_index:part_dimension = "part" ;
        node_index:stop_encoding = “cra” ;
        node_index:outer_ring_order = "clockwise" ;
        node_index:closure_convention = "last_node_equals_first" ;
        node_index:geom_type = "multipolygon" ;
    double x(coordinates) ;
    double y(coordinates) ;

// global attributes:
        :Conventions = "CF-1.8" ;
data:

 instance_name =
  1, 2 ;

 geom_type = "g", "p", "g" ; // each g is a new feature.

 part_start = 0, 5, 9 ;

 node_index = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 6, 13 ; // one entry per geom part node.

 x = 35, 26, 25, 30, 35, 22, 10, 15, 22, 22, 30, 10, 30, 30 ; // one entry per unique coordinate pair

 y = 25, 23, 28, 30, 25, 22, 20, 25, 27, 22, 10, 15, 20, 10 ;

Once this issue is put to rest, we can move on to #27

twhiteaker commented 7 years ago

To me this is easier to read than the version with break values inserted into the coordinate index. The disadvantage of this approach that I see is that it doesn't support single geometry access efficiently. You have to count the "g"s until you're at the instance you want. With a kludgy test on a file with 2.7 million features (National Water Model), it took my PC 3.8 seconds to find where nodes for the 2.7 millionth feature begin. If using counts instead of start indices (motivation would be aligning with existing CF CRA practice of using count, plus Jonathan's email), it takes 5.0 seconds. This could be alleviated with the addition of a geom_start_index(geom) variable that tells me where each geometry instance (i.e., "g") is within part_type.

I still am not aware of how important single instance access is. It seems like something you'd want to support to me, but maybe CF doesn't agree since their CRA structure doesn't support it.

dblodgett-usgs commented 7 years ago

It would be good to have a pointer to the start of each geometry. Oh the joy of shoehorning a hierarchical data model into a flat data structure. An additional variable on the geom dimension wouldn't be the worst thing.

In the case of non-multipart geoms, the part_start and part_type are superfluous and you would just have a geom_start_index(geom). I just put a geom_start:cf_role = "geom_start" variable below. Need to figure out how all these variables would be related to eachother.

netcdf sample_poly {
dimensions:
    geom = 2 ;
    part = 3 ;
    part_nodes = 15 ;
    coordinates = 14 ;
variables:
    int geom_name(geom) ;
        geom_name:cf_role = "geom_id" ;
    int geom_start(geom) ;
        geom_start:cf_role = "geom_start" ;
    char part_type(part) ;
        part_type:long_name = "g for new geometry, p for new part, h for hole." ;
    int part_start(part) ;
        part_start:long_name = "start position of each part" ;
    int node_index(part_nodes) ;
        node_index:long_name = "index into x / y coordinate data" ;
        node_index:geom_coordinates = "x y" ;
        node_index:geom_dimension = "geom" ;
        node_index:part_dimension = "part" ;
        node_index:stop_encoding = “cra” ;
        node_index:outer_ring_order = "clockwise" ;
        node_index:closure_convention = "last_node_equals_first" ;
        node_index:geom_type = "multipolygon" ;
    double x(coordinates) ;
    double y(coordinates) ;

// global attributes:
        :Conventions = "CF-1.8" ;
data:

 instance_name =
  1, 2 ;

 geom_type = "g", "p", "g" ; // each g is a new feature.

 part_start = 0, 5, 9 ;

 node_index = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 6, 13 ; // one entry per geom part node.

 x = 35, 26, 25, 30, 35, 22, 10, 15, 22, 22, 30, 10, 30, 30 ; // one entry per unique coordinate pair

 y = 25, 23, 28, 30, 25, 22, 20, 25, 27, 22, 10, 15, 20, 10 ;
bekozi commented 7 years ago

I still am not aware of how important single instance access is. It seems like something you'd want to support to me, but maybe CF doesn't agree since their CRA structure doesn't support it.

I don't know either. I think single instance / random access is important especially with large datasets. Building in a limitation is never a good idea.

I'll need to think more about this approach. It adds a lot of instrumentation for multi-geometries which will generally be the exception. I personally don't find this more readable or easier to interpret...but that is in the eye of the beholder. I also like how break values keep geometries relatively atomic similar to WKT. This approach will also increase the variable count when multiple geometry types are stored in a single group. I also find character arrays burdensome in lower-level programming languages.

Mostly negative so far but, like I said, I need to mull it over.

One question for now: What is the difference between a "geometry" and a "part"? A multi-polygon may have two "parts". Is the first "part" the "geometry"?

dblodgett-usgs commented 7 years ago

Yeah, the g would signify the first part of a multipart or the only part of a singlepart geometry.

I see it both ways. All valid arguments. I think we are getting to the bottom of the trade offs. The fact that break values 'keeps geometries relatively atomic' is important. The 'instance' (geom) dimension is the dominant one and with the break values approach we are keeping things from getting too sprawling by embedding the parts into break values.

Maybe we roll out a design that breaks everything down like this, point out all the issues with it, and show how the coordinate index / break values approach solves a bunch of problems while maintaining the DSG 'instance' dimension hook into time series?

bekozi commented 7 years ago

Yeah, the g would signify the first part of a multipart or the only part of a singlepart geometry.

Okay. Thanks.

I see it both ways. All valid arguments.

Yes. We need to decide what the right way is and how we'd like to adapt for CF.

Maybe we roll out a design that breaks everything down like this, point out all the issues with it, and show how the coordinate index / break values approach solves a bunch of problems while maintaining the DSG 'instance' dimension hook into time series?

That would definitely be beneficial. It's good to show we've considered other options. And, we may convince ourselves it is the way to go in the end.

dblodgett-usgs commented 7 years ago

Alright. I'm comfortable pushing forward with the coordinate index / break value approach and plan on explaining it in the context of two options. 1) Encoding WKT/WKB in NetCDF directly and 2) encoding all the parts and geometries explicitly using multiple variables / dimensions to handle break values.

Will close this issue and focus on #27 - I got a start on the modifications to the spec for discussion over there.