opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
837 stars 57 forks source link

Restrictions on column names? #155

Closed m-mohr closed 1 year ago

m-mohr commented 1 year ago

Are there any restrictions on column names in Parquet/Arrow that we could check for in the schema? Are these for example all valid?

tschaub commented 1 year ago

Column names correspond to field identifiers in the Thrift IDL. See Identifier here: https://github.com/apache/thrift/blob/v0.17.0/doc/specs/idl.md#identifier

So we could require some pattern like ^([A-Z]|[a-z]|_)([A-Z]|[a-z]|[0-9]|\.|_)*$, but I'm not sure that is a good idea. There may be implementations that accept more than these characters (I know there are implementations that accept fewer). And maybe there will be some future version that accepts a wider range of identifiers.

If anything, I think a good validator would assert that the geometry column names match an existing top-level field name, but I think it might be more trouble than value to add JSON schema validation around the identifiers.