CRA stop index - Githubissues

dblodgett-usgs commented 7 years ago

From conversation in #40 @twhiteaker said:

The CRA examples follow Dave's convention (one based, stop on last index), whereas the readme seems to be zero based and stopping one past the last index. I don't care which way we do it, but we had better be clear and consistent about it. ...Well, maybe I do care. Stopping one past the last index is more Python friendly, but I like stopping at the last index for human readability which I think is more in line with CF.

I really don't care and would be willing to switch my implementation. Don't remember where I got the impression that we were pointing to the last index rather than the first index of the next geometry but I thought that was what we were doing.

... I hate 0 vs 1 indexing by the way. With a passion.

twhiteaker commented 7 years ago

@BobSimons I think you were the one to propose storing the first index of the next geometry as the values of the stop index. Would you like to comment on your reasoning for doing so?

I like "first index of the next geometry" for Python array slicing, but I think "last index of current geometry" is more human readable and thus I suspect more acceptable to CF.

dblodgett-usgs commented 7 years ago

So what's our disposition on this? I think we should just go with one or the other and move out with the proposal. If CF disagrees one way or the other, that's fine. Someone make a decision and let's close this issue!

twhiteaker commented 7 years ago

Suppose instead of a cra_stop variable, we have a cra_start variable which stores the start index? The issue of zero- or one-based indexing goes away (kind of) because you just see what the first cra_start value is; it'll be zero or one. This also resolves the issue of "last index of current geom" vs "start index of next geom" because we're explicitly saying we're storing start index of current geom. And it is Python friendly (my bias) because you would slice for a single geom like this: cra_start[current_geom_index:next_geom_index]

bekozi commented 7 years ago

@twhiteaker We should already have a start_index attribute on the geometry variable. It may not have made its way into the draft yet. I don't think there is any reason to have different start index values for the coordinate index and the ragged stop variable. Won't this be sufficient? I don't think we should add another variable even though it would be more convenient.

I'm going to argue for 1 + element_stop_index + start_index. As @dblodgett-usgs suggests, lets decide and move forward. It's simple to change.

dblodgett-usgs commented 7 years ago

NOTE I Edited this so geometry is used consistently and part is used instead in some places.

I wonder if we are going about this wrong... (I'm spitballing here, I honestly think we should move forward with what we have on the poster for AGU, but we need to continue to consider all these options)

There are two reasons to point to specific indices in the coordinate variables.

To allow polygons to share nodes
To make per-feature data access implementation easier

Reason two isn't a good one because it's so easy to construct that index from a start/stop position and a hole/multi-part/new-geometry indicator for each start/stop position. I'm imagining two variables, one with start or stop positions and another with hole/multi/new indicators.

Reason 1 is fairly boutique and probably shouldn't govern here. Unless I'm missing something, there are very few transport formats that support shared nodes at the encoding level. I could see adding some kind of 'shared node' indicator that would point to nodes in the coordinate array that are the same so it's you wouldn't have to do some similarity search.

So what if we rethought this just a bit? Looking at the actual dimensionality of our multi-geometry data,

we have a 'geometry dimension' (instance in DSG) which is the number of rows in an attribute table
we have a 'part dimension' which is the number of unique linestrings.
and we have a coordinate dimension which is the number of total coordinates in all the linestrings. Let's call these g, p, and c.

Note: p and c are different with VLen. They wouldn't be NetCDF dimensions, but they are still dimensions of the underlying information. Each g has some number of ps and some larger number of cs.

g is more or less the shared dimension between multi-geometries and (potentially multi-dimensional) attributes.

p = g for singlepart geometries, but is otherwise only needed to carry information about which member of g an instance along p is and whether the instance on p is a hole.

c carries the coordinate data its self and shouldn't HAVE to have the same length as the number of nodes in all the features, but would normalize to that length.

Thinking first principles here, what if we only had variables on these dimensions?

We would have some g feature attributes / time series data.
We would have some p variable(s) that indicate how the geometries are broken apart and/or map onto the g dimension.
We would have some c variables that hold all the actual coordinate information.

g I don't think we need anything special on the g dimension in addition to existing DSG stuff.

p There are a number of ways of doing p variables.
If it's ordered like a WKT string, we could use it to hold the break values, including breaks between geometries. Those could be letters? Say... 'g' for geometry, 'p' for part, and 'h' for hole? A second p variable needs to map onto the c dimension. That could be 'count', 'start', or 'stop'.
Holding to this thought exercise, of having variables on these three dimensions, this should be all we need on this dimension.
If things were not strictly ordered according to the g dimension and we want to use letters to indicate geometry type, we would need an additional variable that says which member of g each member of p belongs to. Indices along g for each geometry and -1, -2s for each part would cover this.

c For c variables, it's pretty clear that the actual coordinates for all the geometries go in here.
To support shared coordinates, we could take an approach like we are now, where there's a 'collapsed' c dimension, call it cc and a 'full' c dimension that maps into the collapsed dimension. Alternatively, you could have a topology variable that would be (# of shared coordinates) X 2 and would just indicate which members of c are meant to be the same. I like this approach because it allows non-topology aware clients (most?) to work and is just an add on for those who want to track topology.

If there is a proposal (or central idea) here that might nudge what we have now, I think it is this:

Introduce a character variable like Jonathan's OOIOOIIIOOOO that is the length of p and ordered along g. I like the character route with shared order, but could be convinced that integers pointing to specific members of g is a better idea.
Adopt the 'count' DSG paradigm directly, with a count variable along the p dimension.
Require all node data be encoded along c regardless of topology.
Introduce an optional topology variable that indicates coordinate pairs that need to be interpreted as the same by clients that can.

This would get rid of our CRA that reverse engineers WKT. I increasingly think it's complicating things unnecessarily.
For Vlen, the same approach could be taken but the character variable would be VLen, the count would go away, and the topology variable would be (# of paired coordinates) X 4 to say g, c maps onto g, c.

Hopefully this is half way helpful for others. It was for my thought process. I wanted to walk through a thought experiment using only g, p, and c dimensions for the sake of doing it.

dblodgett-usgs commented 7 years ago

@bekozi - By saying

1 + element_stop_index + start_index

You are saying the start position of a given feature is 1 + (the value of the element_stop_index for that feature/instance) + (the value of the start_index attribute) ?

I agree that is how we should move forward for now.

twhiteaker commented 7 years ago

@bekozi Rather than a start_index attribute, I was suggesting we rename the stop_of_geom variable to start_of_geom. I converted our CF response CDL to illustrate this. I think this avoids the ambiguities we've encountered with stop_of_geom.

geom=3;
index=44;
node=36;
int start_of_geom(geom);
  start_of_geom:contiguous_ragged_dimension = "index"
int coordinate_index(index);
  coordinate_index:geom_type="multipolygon";
  coordinate_index:geom_coordinates="x y";
  coordinate_index:geom_dimension="geom";
  coordinate_index:stop_encoding="cra";
  coordinate_index:multipart_break_value=-1;
  coordinate_index:hole_break_value=-2;
  coordinate_index:outer_ring_order="anticlockwise";
  coordinate_index:closure_convention="last_node_equals_first";

float x(node);
float y(node);
start_of_geom=1, 25, 38;
coordinate_index=1,2,3,4,-2,5,6,7,-2,8,9,10,-2,11,12,13,-1,14,15,16,-1,17,18,19,
20,21,22,-1,23,24,25,26,27,-2,28,29,30,
31,32,33,-1,34,35,36;
x=0, 20, 20, 0, 1, 10, 19, 5, 7, 9, 11, 13, 15, 5, 9, 7, 11, 15, 13, 
-40, -20, -45, -20, -10, -10, -30, -45, -30, -20, -20, 
30, 45, 10, 25, 50, 30;
y = 0, 0, 20, 20, 1, 5, 1, 15, 19, 15, 15, 19, 15, 25, 25, 29, 25, 25, 29,
-40, -45, -30, -35, -30, -10, -5, -20, -20, -15, -25, 
20, 40, 40, 5, 10, 15;

@dblodgett-usgs Can you work up a simple CDL? I really want to understand your igc approach but am having trouble wrapping my mind around it.

I'm fine with going with whatever ya'll want for AGU. It's more urgent that we finish the poster (and lightning talk!), and sometimes presenting a method that has some drawbacks results in insightful feedback.

I'll be out for the rest of today and all of Thursday, so please don't let me hold up the process.

bekozi commented 7 years ago

Rather than a start_index attribute, I was suggesting we rename the stop_of_geom variable to start_of_geom. I converted our CF response CDL to illustrate this. I think this avoids the ambiguities we've encountered with stop_of_geom.

I see. I like change from stop to start. It removes the stop or stop+1 issue. However, the start_index relates directly to the coordinate index values (i.e. does 1 in the coordinate index mean the first or second element in the coordinates array). This also applies to the geometry start index. I just think start_index is here to stay as it will affect the indexing. Unless we enforce zero- or one-based indexing at the spec level.

The basic access paradigm for a Python client is:

coordinate_index = coordinate_index - start_index
start_of_geom = start_of_geom - start_index

@dblodgett-usgs: Are you okay switching from stop to start? I don't think we need to do this for AGU.

Can you work up a simple CDL? I really want to understand your igc approach but am having trouble wrapping my mind around it.

I would like to see CDL or something of the like. I am in the same boat as @twhiteaker.

dblodgett-usgs commented 7 years ago

I'm fine switching to start index. I can make the change in my reference implementation and update things tonight.

dblodgett-usgs commented 7 years ago

NOTE: edited to clarify usage of geom and part.

Here's an example of what I'm talking about:

netcdf sample_poly {
dimensions:
    pair = 2 // for pairing up coordinate indices
    topo = 1 / only one pair in the example below
    geom = 2 ;
    part_index = 3 ;
    coordinates = 15 ;
variables:
    int geom_name(geom) ;
        geom_name:cf_role = "geom_id" ;
    char part_type(part_index) ;
        part_type:long_name = "g for new geometry, p for new part, h for hole." ;
    int part_count(part_index) ;
        part_count:long_name = "number of cordinate pair nodes in each part" ;
    double x(coordinates) ;
    double y(coordinates) ;
    int topo_pairs(pair, topo) ;

// global attributes:
        :Conventions = "CF-1.8" ;
data:

 instance_name =
  1, 2 ;

 part_type = "g", "p", "g" ; // each g is a new feature.

 part_count = 5, 5, 5 ;

 x = 35, 26, 25, 30, 35, 22, 10, 15, 22, 22, 30, 10, 22, 30, 30 ;

 y = 25, 23, 28, 30, 25, 22, 20, 25, 27, 22, 10, 15, 22, 20, 10 ;

 topo_pairs = 6, 13 ;

Here's the WKT to go with it. GEOMETRYCOLLECTION(MULTIPOLYGON (((35 25, 30 30, 25 28, 26 23, 35 25)), ((22 22, 22 27, 15 25, 10 20, 22 22))), POLYGON ((30 10, 30 22, 22 20, 10 15, 30 10)))", id = c(1,2))

bekozi commented 7 years ago

Thanks @dblodgett-usgs. Where is the geometry type stored (point, multipoint, polygon, etc.)? Is it intended to be stored in geom_type? This appears to be working from the CF response letter in regards to geometry and feature collections - expanding the geometry type static attribute to a vector of geometry types. I see complexity going up here I admit (I don't view @BobSimons's CRA as horrifically complex). I know of few practical applications for geometry collections. Do you have some good ones?

Reason 1 is fairly boutique and probably shouldn't govern here. Unless I'm missing something, there are very few transport formats that support shared nodes at the encoding level. I could see adding some kind of 'shared node' indicator that would point to nodes in the coordinate array that are the same so it's you wouldn't have to do some similarity search.

I don't know of any scientific mesh formats that do not allow for node sharing. Using existing GIS transport formats as examples is a weak basis for discounting node sharing as most break down with large data quantities. Thinking in terms of large-scale should be a first consideration.

Including topology as a first-class citizen in the format will help if there is future interest in other forms for topology such as neighbors, etc.

Reason two isn't a good one because it's so easy to construct that index from a start/stop position and a hole/multi-geom/new-geom indicator for each start/stop position. I'm imagining two variables, one with start or stop positions and another with hole/multi/new indicators.

This approach makes accessing the second geometry a bit interesting. One would need to have an independent counter looping through geom_type to find the second occurrence of g. The main loop counter would then be used to find out how many geom_counts to sum up to get the indexing for the coordinates. Am I missing something? Very possible.

I like this approach because it allows non-topology aware clients (most?) to work and is just an add on for those who want to track topology.

As far as I know, we are the only clients using this! Hence, we can be topology-aware. I'm starting to be at a loss why there is significant push-back against an index variable into a coordinates array. Were I to generate archive files for large geospatial datasets in NetCDF, I would absolutely want node de-duplication. I would return to a modded UGRID without this capability.

Some other thoughts/questions:

There is a strong argument for an approach like this if we want to support geometry collections.
I see repetition for statically typed collection - needing to repeat "multilinestring" for each g or new geometry.
I find geom_count and geom_type misleading in this CDL as well. Other names should be considered.
Would you allow for multiple geom_type variables in this spec?
This approach feels more verbose than what we have so far.
A strength of the approach we have is the ability to also easily store bounding boxes for any geometry variable (i.e. as another geometry variable). How would this work in this framework?

This is not meant to be entirely negative. I have seen UGRID-like mesh formats work well for high-performance GIS and regridding so am biased in their favor (readily interpretable in low-level Fortran/C with multi-geometries to boot). Hence, this is why I am a strong proponent of approaches mimicking those as much as possible for storing a broader set of geospatial data. I need to convince myself @dblodgett-usgs's proposed approach is superior in this regard!

dblodgett-usgs commented 7 years ago

NOTE I Edited this so the use of geometry is cleaned up. Sorry for the confusion.

I left a few things out in order to get the gist accross.

Where is the geometry type stored... ?

What I have as part_type would be a logical place. If we require an ID variable, that would be a good place to store all the information about a set of geometries.

Including topology as a first-class citizen in the format will help if there is future interest in other forms for topology such as neighbors, etc.

I don't disagree and I think this will end up being THE reason to use an approach with explicit node indexing as we have pursued.

This approach makes accessing the second geometry a bit interesting.

Let's be clear about some lingo here. I use part for lack of a better term for an independent linestring or polygon. I use feature to refer to multilines or polygons. Sorry for the confusion... we need better words. R's sp object uses Polygon and polygon <- horrible!
To your point, you would have to read in and interpret the count and type vectors. Presumably you'd construct an implementation specific index that let you random access the individual features and/or geometries. This thought experiment was prompted by an observation that our approach is getting pretty implementation specific and it might benefit from a more data-centric approach.

I'm starting to be at a loss why there is significant push-back against an index variable into a coordinates array.

Me too, to be honest. That's why I think we should run what we brung and see if the push back is real or not. I think the real issue I'm still having is interspersing break values so the coordinate index is longer than the coordinates. It just feels better to have variables of length features/instances, lines/polygons, and nodes.

Again, just spitballing. What about something like:

netcdf sample_poly {
dimensions:
    geom = 2 ;
    part_index = 3 ;
    part_nodes = 15 ;
    coordinates = 14 ;
variables:
    int geom_name(geom) ;
        geom_name:cf_role = "geom_id" ;
    char part_type(part_index) ;
        part_type:long_name = "g for new geometry, p for new part, h for hole." ;
    int part_start(part_index) ;
        part_start:long_name = "start position of each part" ;
    int node_index(part_nodes) ;
        part_index:long_name = "index into x / y coordinate data" ;
    double x(coordinates) ;
    double y(coordinates) ;

// global attributes:
        :Conventions = "CF-1.8" ;
data:

 instance_name =
  1, 2 ;

 geom_type = "g", "p", "g" ; // each g is a new feature.

 part_start = 0, 5, 9 ;

 node_index = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 6, 13 ; // one entry per geom part node.

 x = 35, 26, 25, 30, 35, 22, 10, 15, 22, 22, 30, 10, 30, 30 ; // one entry per unique coordinate pair

 y = 25, 23, 28, 30, 25, 22, 20, 25, 27, 22, 10, 15, 20, 10 ;

dblodgett-usgs commented 7 years ago

Having slept on this and read through this, let's go with the C/Python style index-as-offset. OPeNDAP and NetCDF-C are both 0 indexed so we should probably use that.

I also think it's a really good idea to switch to start index instead of stop.

dblodgett-usgs commented 7 years ago

Final comment on this one for now. I updated a bunch of the language above. I totally messed up my usage of geometry. Everything now uses 'part' to indicate a multi-geometry part. I actually really like the last CDL I wrote up that uses a node index and a 'collapsed' coordinate dimension. I've updated that CDL to use 0 indexing and renamed things a bit from last night. Still think we should move forward with what we have for AGU, but this idea is growing on me since it sticks more faithfully to the dimensionality of the data and doesn't mix 'break values' with index values.

twhiteaker commented 7 years ago

Thanks for the CDL. With the addition of some attributes to tie the variables together, I think this approach could be as viable as our current approach. The best choice depends on priorities. Our current approach immediately enables single geometry access whereas this one does not. I keep thinking of the national water model whose output includes 2.7 million river reaches. I have not done performance testing to see how long it would take to build my own index on 2.7 million features so I can grab an individual geometry. Maybe it's insignificant. Or maybe the NWS would split results by HUC since 2.7 million river geometries would require lots of the GBs anyway.

On the other hand, the CDL above supports multiple geometry types. I don't see a reason to support that, and it sure makes clients and conversions for many popular geospatial formats and software easier if we don't support that, but there it is in case anyone wants it. This approach is also closer to Jonathan's CDL, with the main difference being that he used counts and we're using indexing. Indexing lets us reuse nodes; I wonder if CF community cares about that.

bekozi commented 7 years ago

Let's hash this out in some calls early next year. I'll schedule them in early January. There are a lot of good ideas and things we need to consider. I feel we should try and reach consensus in the next few months!

@dblodgett-usgs / @twhiteaker: Think we need a poster reorg? The flow does not tell a great story. I'm 50/50. All the content is there and is definitely sufficient for talking points. One reorg option:

Justification & Motivation moved to left column.
Storing geometries in NetCDF moved to middle.
Roadmap moved to right.
Contiguous... and Proposed... moved to central column.
Hydrologic Use Case... moved to right.

@twhiteaker: If you're interested (and @dblodgett-usgs for that matter), here is the CDL for NWM catchments in ESMF Unstructured Format. The encoding uses multi-geometry break values as many of the catchments are in fact multi-polygons. Converting into this format from a File Geodatabase feature class was interesting. It required a parallel implementation and node-sharing was scrapped in pursuit of lower conversion times. Working with these data structures in parallel should be a consideration. How easy is it to break the data into chunks? Distributed v. shared memory? More index/stop variables means more things to track across tasks and locally normalize, etc., etc.

netcdf ESMF_Unstructured_Spherical_NHDPlusV21_National_Seamless_20160825_1505 {
dimensions:
    nodeCount = 482710357 ;
    elementCount = 2647454 ;
    coordDim = 2 ;
    connectionCount = 483079766 ;
variables:
    double nodeCoords(nodeCount, coordDim) ;
        nodeCoords:units = "degrees" ;
    int elementConn(connectionCount) ;
        elementConn:long_name = "Node indices that define the element connectivity." ;
        elementConn:polygon_break_value = -8L ;
        elementConn:start_index = 0L ;
    int numElementConn(elementCount) ;
        numElementConn:long_name = "Number of nodes per element." ;
    double centerCoords(elementCount, coordDim) ;
        centerCoords:units = "degrees" ;
    int GRIDCODE(elementCount) ;
        GRIDCODE:long_name = "Element unique identifier." ;
    double elementArea(elementCount) ;
        elementArea:units = "degrees" ;
        elementArea:long_name = "Element area in native units." ;

// global attributes:
        :_NCProperties = "version=1|netcdflibversion=4.4.1|hdf5libversion=1.8.17" ;
        :gridType = "unstructured" ;
        :version = "0.9" ;
        :coordDim = "longitude latitude" ;
}

twhiteaker commented 7 years ago

More index/stop variables means more things to track across tasks and locally normalize, etc., etc.

For that case, it seems like counts would be easier to manage than indices.

Think we need a poster reorg?

I suggest finishing lightning talk slides and then turning back to poster reorg if time permits. A reorg would help but I think the poster as it is would suffice.

dblodgett-usgs commented 7 years ago

All sounds good. I'm going to close this issue for now and we should revisit some of these themes after AGU.

twhiteaker / CFGeom

CRA stop index #46

NOTE I Edited this so geometry is used consistently and part is used instead in some places.

NOTE: edited to clarify usage of geom and part.

NOTE I Edited this so the use of geometry is cleaned up. Sorry for the confusion.