Document indexing system for geometry coordinates

bekozi commented 8 years ago

Describe the indexing system used to map coordinates, breaks, and holes.

bekozi commented 8 years ago

Comment from @BobSimons originally here: https://github.com/bekozi/netCDF-CF-simple-geometry/issues/8#issuecomment-223698709.

I've been thinking about this project this week. Your comments are a good lead in to what I've been thinking about.

It seems that our solution here is to use an indexed array as a way of adding support for a ragged array so that we can efficiently store Simple Geometries in our own encoding of WKT in .nc3 files, since .nc3 files don't support ragged arrays. That's fine. And the problem goes away in .nc4 because ragged arrays are available.

Similarly, it seems like Section 9 of CF (Discrete Sampling Geometries) is all about how to add support for ragged arrays in .nc3 files so that we can efficiently store ragged array DSG data in .nc3 files, since .nc3 files don't support ragged arrays. That's fine. And the problem goes away in .nc4 because ragged arrays are available.

It seems like there is a general problem: people want to use ragged arrays for various data structures. .nc4 has support for ragged arrays. .nc3 doesn't. We seem to be adding support for ragged arrays on a case-by-case basis, and using slightly different systems.

Shouldn't we break the Simple Geometry problem/proposal into two parts:

Make a proposal for a general purpose system for ragged arrays in .nc3 files. Use cases: in addition to DSG and Simple Geometries, the most obvious use case is Strings. It would allow CF to add support for variable-length Strings. This would be a huge bridge between .nc3 and .nc4, allowing .nc3 to gain one of the big features of .nc4. My first pass proposal is: add support for two alternative systems:
1. A generalized version of attributes and indexes as currently used by DSG's Contiguous Ragged Array system.
2. A generalized version of attributes and indexes as currently used by DSG's Indexed Array system.
Make a proposal for storing our encoding of WKT in one of these new, standardized ragged arrays (or WKT/B itself as a variable length String). I have more to say about this, but you get the basic idea.

Thoughts? Love it? Hate it?

bekozi commented 8 years ago

Responses below. Only opinions of course.

I've been thinking about this project this week. Your comments are a good lead in to what I've been thinking about.

A dangerous pastime.

It seems that our solution here is to use an indexed array as a way of adding support for a ragged array so that we can efficiently store Simple Geometries in our own encoding of WKT in .nc3 files, since .nc3 files don't support ragged arrays. That's fine. And the problem goes away in .nc4 because ragged arrays are available.

Yes, that is the idea. The mapping used in the reference implementation follows UGRID: http://ugrid-conventions.github.io/ugrid-conventions/#2d-flexible-mesh-mixed-triangles-quadrilaterals-etc-topology. The conversions I’ve used for many-noded, variable-length polygons use ragged arrays first, always (at least a ragged array-like data structure). If the client is requesting an .nc3 file, then the ragged edges are padded with a fill value (like current UGRID). This is very inefficient for polygons of varying length. This inefficiency more or less disappears with file compression. However, it remains a very inefficient rectangular array when loaded from file. Nc3 support is kind of baked-in for node indexing in this regard and could be added to the reference implementation once reading/writing netCDF files is added.

It seems like there is a general problem: people want to use ragged arrays for various data structures. .nc4 has support for ragged arrays. .nc3 doesn't. We seem to be adding support for ragged arrays on a case-by-case basis, and using slightly different systems.

I think our approach supports ragged and rectangular. Storing efficiently with rectangular arrays with variable-length geometries is another story. Is the Discrete Sampling Geometry approach efficient in this regard? I know you were building to the nc3-nc4 ragged array bridge here, but I just wanted to repeat.

Make a proposal for a general purpose system for ragged arrays in .nc3 files. Use cases: in addition to DSG and Simple Geometries, the most obvious use case is Strings. It would allow CF to add support for variable-length Strings. This would be a huge bridge between .nc3 and .nc4, allowing .nc3 to gain one of the big features of .nc4.

Definitely a cool idea. I need to understand DSG better. This sounds a bit independent from simple geometries but something it could definitely use for .nc3 support.

Make a proposal for storing our encoding of WKT in one of these new, standardized ragged arrays (or WKT/B itself as a variable length String). I have more to say about this, but you get the basic idea.

Storing WKT as a string would be an option with the ragged array API. Storing only WKT limits the format’s applicability and also balloons storage as coordinates are stored as in ASCII-like encodings. Some linestrings and polygons can get quite large. I think the WKT/WKB would make for great extensions, but we should focus on storing coordinate arrays directly at first. Does that make sense? WKB is not human-readable which is a problem…

Love it? Hate it?

I’m somewhere on the interpolated (thin) line between. Again, I need to get a better appreciation for DSG.

dblodgett-usgs commented 8 years ago

I'll mostly echo what @bekozi said.

Near term, we are copying the already established DSG patterns for ragged arrays (rectangular and long styles) warts and all. The understanding is that when CF 2.0 supports NetCDF-4 data types (vlen) we'll have a more elegant encoding.

The idea of using strings for WKT/B representation came up and was shot down pretty quickly. It would more-or-less imply an implementation technology-space of the geospatial data domain, leaving many domain scientists to do some really heavy lifting to parse and work with the data.

Think we are all on the same page here for the most part. I need to spend more time with the readme content that I started drafting up. Will try and more clearly link it to the DSG spec for timeSeries.

Dave

BobSimons commented 8 years ago

Here is a slightly revised and more complete proposal with a complete example at the bottom:

A Proposal for Variable Length (VLen, AKA Ragged) Dimensions and Arrays in NetCDF 3.0.

This is a general system for vlen dimensions and arrays that is very similar to the Discrete Sampling Geometry (DSG) Contiguous Ragged Array (CRA) system.

Use cases:
- The ugrid proposal currently being developed includes a custom system for vlen arrays.
- The Simple Geometries proposal currently being developed includes a custom system for vlen arrays. A given geometry may have a few nodes or 10's of 1000's.
- Discrete Sampling Geometry. This would be an alternative to the current system for Contiguous Ragged Arrays.
Rationale: In all of the use cases, a standard NetCDF 3.0 system for vlen dimensions and arrays would offer a much more compact, appropriate, and standard data structure for the vlen data. It would be a general solution to the use cases above and a wide variety of other use cases. It would allow the CF 1.0 standard to avoid having different systems for ragged arrays in different sections of the standard.

Since vlen dimensions/arrays and vlen Strings are already supported by the NetCDF-C and -Java API's, it should be possible to enhance the library to use the same methods/interface to support reading vlen dimensions/arrays and Strings from an enhanced NetCDF-3 file.

Since vlen dimensions/arrays are in NetCDF-4 files and will presumably be in CF 2.0, this proposal offers a bridge between NetCDF-3 and NetCDF-4 files, respectively. This extends the life and usefulness of NetCDF-3 files.

Sample CDL showing 2 variables (polyLat and polyLon) that share the same vlen dimension:

dimension timeseries=?; obs=?; int timeseries_stop(timeseries); :contiguous_ragged_dimension = "obs" ; double polyLat(obs); double polyLon(obs);

The variable names are not relevant to this proposal.
The attribute "contiguous_ragged_dimension" and its value (the name of a dimension) are the key to this proposal. A variable with this attribute MUST be an integer type (byte, short, or int) and may be a scalar or multidimensional.
Any variable that uses that named dimension is to be interpreted using the information encoded in timeseries_stop (in this case, the stop index for each timeseries feature) in variable that has the "contiguous_ragged_dimension" for that dimension. [The DSG CRA system encodes the length of each feature. By storing the stop index of each feature instead, we can make each vlen chunk randomly accessible.]
More than one variable in a file can have a "contiguous_ragged_dimension" attribute, but each such attribute must have a different value.
If a variable with "contiguous_ragged_dimension" is a multidimensional variable, then you treat it as a 1-dimensional array by looking at all the dimension permutations in row-major order. (See the example below.)
This gives us a system in CF 1.0/netcdf-3 to parallel vlen dimensions/arrays in CF 2.0/netcdf-4.
The approach described here works when the vlen dimension is the rightmost dimension. It is possible to chain these structures to have the vlen dimension in a different location or have a variable with multiple vlen dimensions.
A complete multidimensional example:

If you wanted to store a 3-D array with [dimA=2][dimB=3][vlen=*] doubles, with [0,0]={12.0, 17.0} [0,1]={14.0, 9.0, 3.0} [0,2]={ 5.0} [1,0]={ 8.0} [1,1]={ 2.0, 16.0, 15.0} [1,2]={ 7.0, 1.0} or stated another way: [0,0,0]=12.0 [0,0,1]=17.0 [0,1,0]=14.0 [0,1,1]=9.0 [0,1,2]=3.0 [0,2,0]=5.0 [1,0,0]=8.0 [1,1,0]=2.0 [1,1,1]=16.0 [1,1,2]=15.0 [1,2,0]=7.0 [1,2,1]=1.0 then

The CDL for the file is: dimensions: dimA = 2; dimB = 3; dimC = 12; int var1(dimA, dimB); :contiguous_ragged_dimension = "dimC" ; double var2(dimC);
The size of each of the vlen sub arrays is {{2, 3, 1},{ 1, 3, 2}}
var1 would have {{2, 5, 6},{ 7, 10, 12}}
which is the stop index (1 past the last index) for each vlen chunk.
var2 would have {12.0, 17.0, 14.0, 9.0, 3.0, 5.0, 8.0, 2.0, 16.0, 15.0, 7.0, 1.0}

bekozi commented 8 years ago

@BobSimons Nice. I don't follow it entirely to be sure. :confounded: I think this should be shifted to maybe a wiki page so it doesn't get lost in the tickets. Is that the best spot for this?

BobSimons commented 8 years ago

I think about where it should go. I sorry you didn't follow. So let me restate it specifically for this group:

To store 2 timeseries with polyLon={{-70, -71},{-62, -63, -64}} polyLat={{45, 45.5},{52, 52.1, 52.2}} Do this:

Use this CDL: dimensions: timeseries = 2; nNodes = 5; int timeseries_stop(timeseries); :contiguous_ragged_dimension = "nNodes" ; double polyLon(nNodes); double polyLat(nNodes);
The size of the vlen sub arrays is {2, 3}
so timeseries_stop would have {2, 5} which is the stop index (1 past the last index) for each vlen group. Stated another way: this array has the cumulative size of the vlen groups.
polyLon would have {-70, -71, -62, -63, -64} polyLat would have {45, 45.5, 52, 52.1, 52.2}

Does that make sense?

On Wed, Jun 8, 2016 at 10:45 AM, Ben Koziol notifications@github.com wrote:

@BobSimons https://github.com/BobSimons Nice. I don't follow it entirely to be sure. 😖 I think this should be shifted to maybe a wiki page so it doesn't get lost in the tickets. Is that the best spot for this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bekozi/netCDF-CF-simple-geometry/issues/11#issuecomment-224671693, or mute the thread https://github.com/notifications/unsubscribe/ABarOKPeiIeFnkwHHXZTwURPavte7er2ks5qJv-xgaJpZM4ItDZC .

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

twhiteaker commented 8 years ago

@BobSimons I like this approach and appreciate you keeping the bigger picture of a general nc3 vlen solution within our perspective. I also find the CF CRA approach of indicating counts rather than stop index values cumbersome. I'd be interested in hearing what folks who actually use CRAs think about your approach.

BobSimons commented 8 years ago

I am guessing that from the file writer standpoint, 'count' is easier to understand.

I'm one of those people who actually reads these files (in ERDDAP). The only difference for the reader is:

with counts, to get one vlen group or many, you have to read the entire array, convert it to the cumulative stop indices, then read the data.
with stops, you wouldn't have to read the entire array to get one specific vlen group. That's the advantage of 'stop'. When the array is small, it doesn't matter much. When the array is huge, then it will matter some.

I'm not adamant about count vs size. It is a little cumbersome either way, but both of these options are file-size-efficient. If the CF group prefers one or the other, that's fine with me.

On Wed, Jun 8, 2016 at 11:58 AM, Tim Whiteaker notifications@github.com wrote:

@BobSimons https://github.com/BobSimons I like this approach and appreciate you keeping the bigger picture of a general nc3 vlen solution within our perspective. I also find the CF CRA approach of indicating counts rather than stop index values cumbersome. I'd be interested in hearing what folks who actually use CRAs think about your approach.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bekozi/netCDF-CF-simple-geometry/issues/11#issuecomment-224693063, or mute the thread https://github.com/notifications/unsubscribe/ABarOMVHfw3hZTY6yqWVLHH-ZGd6yp1hks5qJxDDgaJpZM4ItDZC .

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

bekozi commented 8 years ago

I think I am following more closely now. I would definitely advocate for "stops" over "counts". Stops allow using the indices directly and with the addition of multi-part breaks and hole breaks, the "counts" do not make a lot of sense.

@BobSimons and others, why not store a start and stop? Assuming I understand their usage correctly.

BobSimons commented 8 years ago

The downside to storing just "counts" (or "size"), is that the reader has to read the whole array, convert it to cumulative values, then figure out the start and stop indices.

The downside to storing just stop is that the reader has to do some extra work to figure out the start index.

The downside to storing start+stop is the extra space needed in the file. Some people are always concerned about space and object when it is wasted.

Pick your poison. I prefer storing just "stop".

On Thu, Jun 9, 2016 at 9:35 AM, Ben Koziol notifications@github.com wrote:

I think I am following more closely now. I would definitely advocate for "stops" over "counts". Stops allow using the indices directly and with the addition of multi-part breaks and hole breaks, the "counts" do not make a lot of sense.

@BobSimons https://github.com/BobSimons and others, why not store a start and stop? Assuming I understand their usage correctly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bekozi/netCDF-CF-simple-geometry/issues/11#issuecomment-224952173, or mute the thread https://github.com/notifications/unsubscribe/ABarOK_W_pRz0T7T0HrWBY8OZABhGGr8ks5qKEDVgaJpZM4ItDZC .

Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

bekozi commented 8 years ago

Makes sense. To be clear, the "stop" is only needed for the non-vLen, continuous ragged array case where node indices are stored as a single vector. With vLen, ragged arrays, the "stop" does not really have a usage as the index into the ragged array dimension serves the same purpose.

I am fine with using only "stop" for the CRA case. Having a "start" is superfluous.

I am coming around to the idea of supporting both cases (nc4 ragged array and nc3 continuous ragged array). I added a ticket for this to be supported as an output option: #18.

twhiteaker commented 8 years ago

I also vote for "only stop" and supporting nc4 ragged and nc3 CRA.

bekozi commented 8 years ago

Moved CRA proposal to: https://github.com/bekozi/netCDF-CF-simple-geometry/wiki/ContinousRaggedArrays. Going to close this for now. I believe we got the "stops" figured out and moved to a ticket. We can edit the proposal in the wiki as needed.

twhiteaker / CFGeom

Document indexing system for geometry coordinates #11