opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
838 stars 57 forks source link

Introduce bounding box column definition #191

Closed jwass closed 8 months ago

jwass commented 1 year ago

Introduces the per-row bounding box definition following the discussion from https://github.com/opengeospatial/geoparquet/discussions/188.

This initial proposal goes with a definition that looks like:

"geometry": {
  "encoding": "WKB",
  "geometry_types": [],
  "bbox": [-71.1, 42.3, -70.9, 42.5],
  "geometry_bbox": {"column": "bbox"}
}
jwass commented 1 year ago

How do people feel about the bbox column field names xmin, xmax, ymin, ymax?

When putting this together I noticed that geopandas gdf.bounds uses minx, maxx, miny, maxy. @paleolimbot made a good argument that we use the xmin style in other places in GeoParquet so should stick with that, but is worth calling out here.

jorisvandenbossche commented 1 year ago

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

Parquet indeed does not speak about "struct" columns, and this is unfortunately one of those areas that is completely underdocumented. I think the relevant term is "group", but in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types that is only used to explain LIST and MAP logical type.

When translating an Arrow struct column, I get the following Parquet schema:

import pyarrow as pa
struct_arr = pa.StructArray.from_arrays([[1, 2]]*4, names=["xmin", "xmax", "ymin", "ymax"])
table = pa.table({"bbox": struct_arr})

import pyarrow.parquet as pq
pq.write_table(table, "test_bbox_schema.parquet")
pq.read_metadata("test_bbox_schema.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f02f01dac40>
required group field_id=-1 schema {
  optional group field_id=-1 bbox {
    optional int64 field_id=-1 xmin;
    optional int64 field_id=-1 xmax;
    optional int64 field_id=-1 ymin;
    optional int64 field_id=-1 ymax;
  }
}

So something like "a group field with 4 child fieds with names ..."?

jwass commented 1 year ago

I do think that nesting the "covering" or "simplifed_geometry" or "auxiliary_columns" or "something_better" is a more future-proof option: this proposal is for a single struct column that represents the minimum bounding rectangle, but we might need other encodings (or multiple columns for one encoding) and I think it will better communicate that those concepts are related if they are grouped in the same object.

Sounds good. I just made the updates to go to that original proposal with the "covering" section. I admit to not having a better word so let's stick with it for now.

A bounding box column MUST be a Parquet struct

It's worth checking the Parquet specification about what they call a "struct". I believe there's something about "definition levels" involved and that struct might be the Arrow term for the (far more user-friendly) way to conceptualize it.

Good call. I went with @jorisvandenbossche's language here. And I agree, the Parquet docs/website are really lacking on this.

I think a struct is the best way to go here, but it is worth noting that another option would be to refer to top-level columns. The only reason to do this is that some Parquet scanners can't leverage column statistics for struct columns (although this might not matter for Parquet scanners that will be using this column). This should get revealed in our testing of the bbox column concept!

Agreed. We can change as necessary depending on the outcome of that testing.

For three dimensions the additional fields zmin and zmax MUST be present.

In GeoPackage the dimensions of the envelope are independent of the geometry...while some indexes care about 3D, most indexes only care about 2D and it might be an attractive option to only ever write an XY bounding box here.

Thanks. I changed the language to read zmin and zmax MAY be present but not required.

jwass commented 1 year ago

So something like "a group field with 4 child fieds with names ..."?

@jorisvandenbossche Thanks! I made some updates and went with this wording.

rouault commented 1 year ago

I'm just thinking that the use of a struct/group with child fields could potentially be a complication for some implementations. I mean the "flat" model is quite a common one among geospatial formats. I struggled a bit to deal with Parquet nested construct when mapping that to the OGR data model. I've bitten that bullet now, so the current proposal is perfectly fine to me. Just wanted to raise that potential issue for other "simple" hypothetical implementations. But perhaps dealing with Parquet means that dealing with nested constructs is an implicit requirement.

kylebarron commented 1 year ago

What about specifically listing the names of the paths in the schema:

"covering": {
  "box": {
    "xmin": "bbox.minx",
    "ymin": "bbox.miny",
    "xmax": "bbox.maxx",
    "ymax": "bbox.maxy",
  }
}

Then it can be straightforward to either use dot-access notation as above, or to refer to columns at the top-level:

"covering": {
  "box": {
    "xmin": "minx",
    "ymin": "miny",
    "xmax": "maxx",
    "ymax": "maxy",
  }
}

This also means that a struct point column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.x",
    "ymin": "geometry.y",
    "xmax": "geometry.x",
    "ymax": "geometry.y",
  }
}

And a geoarrow linestring column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.list.element.list.element.x",
    "ymin": "geometry.list.element.list.element.y",
    "xmax": "geometry.list.element.list.element.x",
    "ymax": "geometry.list.element.list.element.y",
  }
}

(The exact syntax is up for future discussion; that's what pyarrow.parquet shows for the value of path_in_schema for parquet metadata of a geoarrow linestring file, tested with this file)

<pyarrow._parquet.ColumnChunkMetaData object at 0x114d2a1b0>
  file_offset: 509677
  file_path:
  physical_type: DOUBLE
  num_values: 49494
  path_in_schema: geometry.list.element.list.element.x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x106814270>
      has_min_max: True
      min: 215869.21669693943
      max: 781793.0855915633
      null_count: 0
      distinct_count: None
      num_values: 49494
      physical_type: DOUBLE
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 16382
  data_page_offset: 409875
  total_compressed_size: 493295
  total_uncompressed_size: 493604
<pyarrow._parquet.ColumnChunkMetaData object at 0x114cfe7a0>
  file_offset: 1002855
  file_path:
  physical_type: DOUBLE
  num_values: 49494
  path_in_schema: geometry.list.element.list.element.y
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x114d2b1f0>
      has_min_max: True
      min: 4790520.432107779
      max: 5237313.334385211
      null_count: 0
      distinct_count: None
      num_values: 49494
      physical_type: DOUBLE
      logical_type: None
      converted_type (legacy): NONE
  compression: SNAPPY
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 509812
  data_page_offset: 903053
  total_compressed_size: 493043
  total_uncompressed_size: 493596
jorisvandenbossche commented 1 year ago

And a geoarrow linestring column could be defined using something like

Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering.

kylebarron commented 1 year ago

And a geoarrow linestring column could be defined using something like

Do we want to enable this? Because that gives different abilities. I would expect this covering to be actual min/max values, while for geoarrow that would be a list of values. While for row group statistics this will work the same, that is not the case for plain row filtering.

I think the question is whether covering: box is intended to be used for more than just predicate pushdown. I had assumed that covering: box was only intended to enable statistics in the row group metadata, but that readers should not assume how to use the actual data within the column.

Given that creating bounding boxes from a geometry column (once in memory) should be very fast, I think the main goal should be on solidifying a flexible way to describe the MBR of the chunk in the row group metadata, without defining what the values of that column must be. I like the above because it's flexible among struct and top-level columns while also being future proof to a geoarrow encoding.

(Though to be clear, I'm not arguing for this PR to explicitly allow geoarrow; for now we can only document top-level columns and a struct column)

jwass commented 1 year ago

This also means that a struct point column could be defined using something like

"covering": {
  "box": {
    "xmin": "geometry.x",
    "ymin": "geometry.y",
    "xmax": "geometry.x",
    "ymax": "geometry.y",
  }
}

One nice outcome of this approach is it avoids the duplication (triplication?) of point coordinates that @rouault raised in https://github.com/opengeospatial/geoparquet/discussions/188#discussioncomment-7569062.

Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated.

rouault commented 1 year ago

Although I think I share in @jorisvandenbossche's concern that allowing it to point to non-primitive float columns makes simple row filtering more complicated.

agreed on that: I would expect the content of the xmin/... field to be a scalar to KISS

Maxxen commented 1 year ago

Just chiming in with my 2c. DuckDB can't do predicate pushdowns into struct columns yet (hopefully someday!), and I imagine other simpler readers may also struggle to fully utilize nested types, so I would think the "flat" representation is the most practical/has the lowest "barrier of entry". Although from a conceptual standpoint its kind of a bummer because I feel like this is the perfect use case for structs :/.

migurski commented 1 year ago

Just popping in with a 👍 to this proposal. GDAL’s new support for column stats in 3.8.0 combined with spatially-clustered rows and relatively-small row groups has shown excellent performance in my tests with country-scale OSM datasets.

I hadn’t seen the work on structs here so I was experimenting with separate xmin/ymin/xmax/ymax columns. They seem to be equivalent for read performance.

cholmes commented 1 year ago

What about specifically listing the names of the paths in the schema

I like the upside of being able to handle both structs and top level columns, and handling points without duplication would definitely be a win.

But I worry about the complexity - a reader would need to at least know about all the options, and if it's not capable of predict pushdown would have to know that some fields wouldn't work. And ideally a reader would notify the user in some way that it's not working / implemented. And writer implementations would have to make choices of how to do it. And either push that decision on the user, or else we recommend a 'default'. So if we just allow any field specified then it feels like we just sorta punt on the question of which approach to do, and we push the complexity out to the ecosystem. And it'd be harder to just look at a parquet file and know just by looking at the data if it's got a bbox - if we have one set approach of fields of set names then you can be pretty sure if a dataset has that then it is implementing a spatial optimization.

So I think I lean towards first picking one way and seeing if it will work for everyone. But it could be interesting to use this way to specify that one way if we want to future proof it a bit - like we list the path names, but we align on just having one locked in value for the first release, like how we did the geometry encoding. Maybe that's what you were proposing? I wasn't sure if the idea was people could specify any field they wanted to serve as the box, or if it'd just be a way to specify a set of options.

cholmes commented 1 year ago

One thing that I do think should accompany this is some recommendations on how to actually use this construct to accelerate spatial querying. Like talk through different indexing options, explain a bit about how you need to order the file and how the row group stats work. I think this would likely fit best in some sort of 'best practice' document that sits alongside the spec. And I don't think it needs to be part of this PR, but would be good to have in place for a first 1.1-beta release. I'd be up to take a first crack at it, but would be good for others who are deeper into spatial indexing and parquet to help out.

paleolimbot commented 1 year ago

It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps). It is a bummer because a struct column is definitely the right concept for this.

That would mean:

"covering": {
  "box": {
    "xmin": "name_of_top_level_xmin",
    "ymin": "name_of_top_level_ymin",
    "xmax": "name_of_top_level_xmax",
    "ymax": "name_of_top_level_ymax"
  }
}

A future version of the spec could allow nesting with "xmin": ["name_of_top_level_column", "xmin"] (there's no guarantee that a column name does not contain a period). I don't think we should start by allowing both (or maybe we will never get to a place where we allow both).

jwass commented 1 year ago

It seems like top-level columns are the only way for this proposal to fulfill its goal (allow GeoParquet files to be written such that querying a large file for a small area does not have awful performance) for at least one engine we care about (DuckDB)? The other engine I know can't do this is cudf (less of a priority perhaps).

Agreed. We had discussed surveying existing Parquet readers prior to making a decision, but it feels like we already know the answer without having to do more investigation...?

As @paleolimbot said, we could go with @kylebarron's suggestion of having the column specified per box element, but the spec could enforce that they MUST point to fields in the root of the schema. We always have the option later to relax that constraint without breaking backward compatibility.

Separately, I like the idea of the increased flexibility but wonder if it allows too many personal preferences to exist in the ecosystem... Someone could write:

"covering": {
  "box": {
    "xmin": "minx",
    "ymin": "miny",
    "xmax": "maxx",
    "ymax": "maxy"
  }
}

because they prefer minx over xmin. I think I'd find the competing opinions annoying if many different ones are out there. I guess we could just have a best practice recommendation of how to define these. If we want to recommend plain xmin/... we'd need a suggestion for multiple geometries too.

paleolimbot commented 1 year ago

If we want to recommend plain xmin/... we'd need a suggestion for multiple geometries too.

Perhaps a prefix? <geometry column name>_xmin, <geometry column name>_ymin, etc?

kylebarron commented 1 year ago

Someone could write ... because they prefer minx over xmin

Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants minx, that seems totally fine as long as the metadata is correctly specified.

jwass commented 12 months ago

Is that a problem? I don't think readers should be assuming anything about the column names except what is defined in the metadata. If someone wants minx, that seems totally fine as long as the metadata is correctly specified.

As someone that spends a lot of time writing SQL, the real interface of a Parquet file for me is the table schema's columns and data types because most SQL query engines can't read the GeoParquet metadata - at least not for a while. If I want to reuse queries across different GeoParquet datasets, they'll have to be dynamic over column names which is a pain especially if determining the columns requires reading the data before I know how to query it.

I guess in my opinion, the more standardized the Parquet schema is - column names and types - and the less flexible the metadata allows it to be, the easier GeoParquet will be to work with and the more successful it will be. This is a very similar discussion to https://github.com/opengeospatial/geoparquet/discussions/169 as well. The broader difference of philosophy is beyond the scope of this PR but a fun one that I think will keep coming up :)

Either way, I think Chris's proposal to constrain the allowed values could make this work and future-proof to make it more open-ended later.

jwass commented 12 months ago

Thanks everyone for the feedback and good discussion. Here's my read on what would be a good path forward:

Specify every bounding box column

We can adopt @kylebarron's idea to define the bounding box like this:

"covering": {
  "box": {
    "xmin": "xmin",
    "ymin": "ymin",
    "xmax": "xmax",
    "ymax": "ymax"
  }
}

Constraints on the initial release that can be relaxed later

Top-Level Columns

Since duckdb and others don't do predicate pushdown on nested columns, the initial definition should require the box columns to be at the schema root. Otherwise we'll be negating most performance benefits for some systems.

Restricting Column Names

Per @cholmes's comment, we can start by restricting the column names and pinning them to be xmin, xmax, etc. initially. This will keep GeoParquet datasets consistent in their schemas for some time, but give us some future-proofing to allow nested columns (bbox.xmin) or arbitrary column names ({"xmin": "minx", ...}") in the future without having to redesign the box spec presumably.

What to do about files with multiple geometries?

We can't require the same names for a different geometry column. I like @paleolimbot's idea of constraining the names to be a prefix that is the geometry column name + "_". So the bbox columns for my_other_geom are my_other_geom_xmin, my_other_geom_ymin, etc. But I also like the idea of the primary geometry's bbox just being called xmin rather than e.g. geometry_xmin. In other words, we could say the primary geometry column's prefix is the empty string, but all others are column_name_.

Thoughts? I can make the PR changes soon if there's agreement on the direction here.

tschaub commented 12 months ago

@jwass - I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like <geometry-column>_xmin (etc), those may be treated as values representing the geometry bounds. This would allow us to later add geo metadata that pointed to other columns serving the same purpose (and to potentially have other covering types).

jwass commented 12 months ago

I'm not sure if this is what you are suggesting as well, but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like <geometry-column>_xmin (etc), those may be treated as values representing the geometry bounds.

@tschaub That's not quite what I was suggesting since I am proposing adding this to the metadata. That way the systems that can interact with the metadata can still find it in a definitive way. We could define the bounding box more as convention for now without touching the geo metadata yet, and/or add it to the compatible parquet spec. But it seemed like most folks wanted it in there.

tschaub commented 12 months ago

If we don't make a change to the geo metadata, it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases - without needing a breaking release if we find out that the initial proposal wasn't quite right.

migurski commented 12 months ago

The min/max columns create a single-level spatial index and are absolutely useful in selection use cases. This example shows a 44MB Geoparquet file for the country of Ghana with only data for the town of Kumasi selected:

https://nbviewer.org/urls/gist.githubusercontent.com/migurski/e88e6d34d8686ed54c8a3ed158061409/raw/3ba8640c38f4a87574e976000450d5e17ade4c17/index.ipynb

Preview, showing in-Kumasi row groups in blue vs. non-Kumasi row groups in gray:

Screen Shot 2023-12-03 at 6 23 09 PM

With row groups of 100 features and everything sorted by quadkey of one feature corner we use the column stats alone to identify 106 row groups out of 1,001 total that must be read. This saves almost 90% of required reads. I did some tests with other space-filling curves and they don’t make a substantial difference. It’s the min/max column stats and the clustering alone that do it.

For larger source files this would be a requirement to use the network efficiently. Worldwide OSM road and building data can be dozens or hundreds of GB in size and I wouldn’t be able to afford a table scan to efficiently query a region.

paleolimbot commented 12 months ago

In other words, we could say the primary geometry column's prefix is the empty string, but all others are columnname.

I think it may be more straightforward if the suggestion is that all top-level column names are prefixed (but I'm not opposed to dropping the prefix for the primary geometry column, and because it would just be a suggestion, it could be updated in either direction if we find good reason to do so). My ever so slight preference is because I think an automatic query rewriter (e.g., Sedona, GeoAlchemy) might have an easier time transforming st_intersects(geometry, <some geometry lieteral>) into geometry_xmin >= <some literal> AND ... ANDst_intersects(geometry, )` in the absence of metadata.

but what about making no changes to the geo metadata (yet) and only saying that if a GeoParquet file has float columns name like _xmin (etc), those may be treated as values representing the geometry bounds.

I think that adding this to the metadata is the right way to do this. Even if we didn't touch the metadata but added the column name convention to the specification, it would still be a breaking change to update that (implementations would have to change how or if they are able to accelerate a partial read of a GeoParquet file). I think we very much want adopters of GeoParquet to leverage what is in this proposal with confidence.

it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases

I think we are all on board that we should demonstrate at least a few examples of accelerated partial reads before merging this! It will take me a little while to get there but I can attempt a POC in geoarrow-python's reader (leveraging pyarrow's Parquet scanner with column statistics support).

jwass commented 12 months ago

If we don't make a change to the geo metadata, it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases

I think the performance improvements of including bounding box columns has been demonstrated already. One of the earliest examples I saw is Eugene Cheipesh's presentation: https://www.youtube.com/watch?v=uNQrwMMn1jk&t=585s (this link takes you to the results section of his presentation) where he shows that a Parquet scan of his data takes 4 minutes, but when spatially ordered and including bounding box columns takes the query performance to 1.5 seconds.

And just also saw @migurski's comment above.

To @cholmes's earlier point, I think this could come along with some more formal profiling, best practices, tooling, etc.

cholmes commented 12 months ago

it would give us a chance to actually demonstrate that the min/max columns have some value in real use cases

I think we are all on board that we should demonstrate at least a few examples of accelerated partial reads before merging this!

This may be better for a live discussion at the next meeting, but I think we need to figure out our release tactics. I see two options:

1) Merge this soon (like within 1-2 weeks), but call it 1.1-beta.1, and put clear caveats that this may change, it's early adopter time, so give us feedback but we reserve the right to change things.

2) Leave this as a PR and verify on like 5+ implementations that this works, it accelerates things, everything can deal with it. And then I think when we merge we could call it version 1.1

I had been imagining #1, and with that route I do like Tim's suggestion of not actually updating the geo metadata yet - like for the beta.1 release. But if we're just leaving it as PR in testing then I think having the geo metadata in could make sense, as I agree that the metadata really is the spec (but that the format should 'degrade' nicely and be easy to interpret without having the metadata). I definitely don't feel strongly between the two, but thought that #1 might get a bit more attention from like the cloud data wherehouse vendors who don't show up at the meetings / engage on github.

It does sound like I'm a bit less ready to give in to the 4 top level columns vs. a struct. I'd like to at least figure out what the level of effort is for DuckDB to support the pushdown, and to see if we can get sufficient funding (I sent an email to Max and Hannes to ask). If it's a no go for one or two cloud vendors I'm also totally fine to give it up. But it just feels so much cleaner to just add one new column vs 4, and everything will need the predicate pushdown functionality to be able to support GeoArrow encoding well.

tschaub commented 12 months ago

To be clear, I don't doubt that statistics on bounds will be useful. I was just suggesting that we could avoid breaking changes by specifying as little as possible and letting real world best practices inform the spec. (Bounds stats are clearly useful, but we don't necessarily know if people will want control over nesting etc.)

Maxxen commented 12 months ago

I'm also inclined to maybe wait a bit, I don't want the standard to be damned just because DuckDB has some issues (DuckDB does not even implement the 1.0 spec yet).

But honestly at this time Im just not familiar enough with the in-depth details of the parquet format to really say if the lack of struct predicate pushdown is a temporary embarrassment in DuckDB's implementation or something that is significantly harder to implement compared to the top-level columns case to the point that it would hamper future adoption. I need to find the time to look into this in our implementation, but I guess what I'd really like to see is some sort of survey/input from additional vendors.

From this thread, my understanding of the struct-predicate-pushdown situation currently is:

But what other parquet-reader implementations are there, and what capabilities to they have?

kylebarron commented 12 months ago

I asked on discord and DataFusion is not currently able to query into a nested struct column.

Maxxen commented 10 months ago

So a little bit of an update, I have been working on getting struct filter pushdown into DuckDB and have a PR up here. This implementation is very limited but should be sufficient to support the struct-column proposal discussed here. E.g. filters are only pushed down "one level", and only for constant comparisons. It might require a couple rounds of back-and-forth before it gets merged but should hopefully make it to next DuckDB release scheduled in a couple of weeks.

So to provide some additional input to the discussion, after getting more exposure into the internals of parquet and some experience actually implementing this, I don't think that storing the bounding box as a struct is going to make it significantly more difficult for implementers. Since all "top level" columns are basically stored as a single struct, if you can read/filter a top level column it should be relatively straightforward to extend a column reader to handle nested structs by recursion.

jwass commented 10 months ago

So a little bit of an update, I have been working on getting struct filter pushdown into DuckDB and have a PR up here.

This is great news! Thanks for the update!

jwass commented 10 months ago

In light of @Maxxen's note above and the initial results of our survey in https://github.com/opengeospatial/geoparquet/discussions/192 I'd like to resurrect this to push forward with the bounding box concept.

jwass commented 10 months ago

Curious to hear some specific thoughts on: 1 - agreeing to move forward with this bounding box "struct" definition with fields xmin, xmax, ymin, ymax 2 - the current PR spec definition of "covering": { "box": { "column": "bbox" }}

Assuming general alignment here, some potential objections: For 1 above - should we keep these field names or switch to minx, maxx, miny, maxy? For 2 - should we refactor to @kylebarron's idea of

"covering": {
  "box": {
    "xmin": "bbox.xmin",
    "ymin": "bbox.ymin",
    "xmax": "bbox.xmax",
    "ymax": "bbox.ymax",
  }
}

and "pin" the values initially to these constants or at least something that ends in ".xmin", ... . This might enable more flexibility in the future.

Thoughts? @cholmes @paleolimbot @kylebarron @jorisvandenbossche

cholmes commented 10 months ago

+1 on moving forward with bounding box struct - thanks @jwass for leading the charge on the survey on #192, and thanks to everyone who contributed. Seems like most everyone can or will soon support it.

And I'm +1 to refactoring to Kyle's idea, but pinning to the definition now, for more flexibility in the future. I like establishing one clear way to do it, that establishes a convention where geo metadata isn't required, but once we get every reader understanding the 'geo' metadata then we'll have the ability to try to define other 'paths' without making it a breaking change.

jwass commented 10 months ago

PR is now updated to reflect @kylebarron's idea:

"covering": {
  "bbox": {
    "xmin": "bbox.xmin",
    "ymin": "bbox.ymin",
    "xmax": "bbox.xmax",
    "ymax": "bbox.ymax",
  }
}

Note - the JSON schema requires the fields to end in .xmin, .ymin, etc. However I don 't believe there's a way in JSON schema to require that they all have the same prefixed column name. Instead it's just written in the spec description and will have to validated separately.

Also - I named the field under "covering" to be "bbox" from "box" since it now more closely resembles the bbox field.

jorisvandenbossche commented 10 months ago

PR is now updated to reflect @kylebarron's idea:

Do we still need this explicitness if we now decided to go with the struct columns (instead of allowing top-level columns)? Because I think @kylebarron's idea was mostly to allow the flexibility of specifying whatever column you wanted?

cholmes commented 10 months ago

Do we still need this explicitness if we now decided to go with the struct columns (instead of allowing top-level columns)? Because I think @kylebarron's idea was mostly to allow the flexibility of specifying whatever column you wanted?

We discussed in geoparquet community today. Agreed that we don't really need the flexibility now, but that there's little downside to being explicit here, and it opens the flexibility in the future.

jwass commented 10 months ago

My only question is what happens when there is a period in column_name. I don't think it's hard to imagine that it would live with the lefthand side (i.e., not one of the struct fields), but also not particularly important to solve right now!

@paleolimbot This came up today and @jorisvandenbossche mentioned it might be common in R. My guess is in some SQL engines you'd refer to a column like that with quotes like:

SELECT
    "my.column.with.periods".xmin
FROM t

We could require something similar such that periods are nested groups but when they are within quotes they're just part of the column name. The quotes would just have to be properly escaped in the JSON.

jorisvandenbossche commented 10 months ago

A more robust way to specify a "nested path", is to specify each part of the path as a separate string. In Python we would naturally use a tuple for that, like ("column_name", "xmin"). In JSON that translates to an array, I don't know if that would be strange to use here.

I was quickly looking into how I would implement this bbox filter with pyarrow, and there I will actually need to convert the dotted path we have here to a tuple or list of paths, because to specify a filter when reading I can do something like filter=(pc.field(("bbox", "xmin")) > xx) & (..) & ... The current pc.field(..) helper function to specify a field doesn't support a dotted path, but instead requires you to be specific (which allows to specify a top-level field name with dots in the name). We could of course change this in pyarrow, but I also like the explicitness.

That means that I would need to split the dotted path. A simple bbox_metadata["xmin"].split(".") would work in most cases, but that would not work when there are quoted sub-paths inside the string. Supporting something like "\"my.column.with.periods\".xmin" would make this quite a bit more complicated.

rouault commented 10 months ago

In JSON that translates to an array, I don't know if that would be strange to use here.

I was also thinking to that

Or, as we want all the xmin/ymin/xmax/ymax to be subfields of the same field:

"covering": {
  "bbox": {
    "struct_field": "bbox",
    "xmin_subfield": "xmin",
    "ymin_subfield": "ymin",
    "xmax_subfield": "xmax",
    "ymax_subfield": "ymax"
  }
}
jwass commented 10 months ago

If we want to be the most future-proof, maybe we should just go with the array:

"covering": {
  "bbox": {
    "xmin": ["bbox", "xmin"],
    "ymin": ["bbox", "ymin"],
    "xmax": ["bbox", "xmax"],
    "ymax": ["bbox", "ymax"],
  }
}

Kind of annoying but it should make things much easier for readers that need to find the right field in the schema as @jorisvandenbossche pointed out. We can constrain the json schema so that each element has length 2 and fix the final element which gives lots of flexibility to relax those later.

I'm good with it.

jwass commented 10 months ago

Updated to new format as described above. @jorisvandenbossche I re-requested a review just to make sure this in line with your thinking here.

rouault commented 10 months ago

FYI, I've a GDAL implementation of this in https://github.com/OSGeo/gdal/pull/9185

cholmes commented 9 months ago

Can we merge this in? Any remaining work to do? If I don't hear in a day or two I'm going to hit the green button.