planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
138 stars 8 forks source link

Error when doing describe / validate with non-geo parquet file #86

Closed cholmes closed 9 months ago

cholmes commented 9 months ago

Trying to figure out if a parquet file at https://open.quiltdata.com/b/spatial-ucr/tree/census/administrative/counties.parquet is valid geoparquet. Run 'describe' and 'validate' and get:

gpq: error: command.ValidateCmd.Run(): unable to parse geo metadata: json: cannot unmarshal string into Go struct field GeometryColumn.columns.crs of type geoparquet.Proj

Note that 'convert' works fine, and then can describe it / validate it.

tschaub commented 9 months ago

I updated a few things to try to avoid bad error messages like that. The issue is that that file has geo metadata with a string crs instead of an object (see below for more detail). Previously, the crs rule would not treat this as a "fatal" error, so additional rules would try to run. Later rules require valid metadata to run, so the validate command would fail in an ugly way when trying to run those later rules. I've updated the crs rule (and a couple other rules) to treat these unexpected types as fatal errors. Ideally, we run as many rules as possible to give a better idea of what can be fixed in a file, but in cases like this, we have to stop running additional rules (since they depend on being able to parse a complete geo metadata struct).

In #89, I added some additional handling to still output the rest of the column info even if the geo metadata cannot be parsed. So now, the describe output for the file above looks like this:

# gpq describe counties.parquet                
╭──────────┬────────┬────────────┬────────────┬─────────────╮
│ COLUMN   │ TYPE   │ ANNOTATION │ REPETITION │ COMPRESSION │
├──────────┼────────┼────────────┼────────────┼─────────────┤
│ geoid    │ binary │ string     │ 0..1       │ snappy      │
│ geometry │ binary │            │ 0..1       │ snappy      │
├──────────┼────────┴────────────┴────────────┴─────────────┤
│ Rows     │ 3233                                           │
╰──────────┴────────────────────────────────────────────────╯
 ⚠️  Metadata parsing failed, try running describe with the --metadata-only flag.  Error message: unable to parse geo metadata: json: cannot unmarshal string into Go struct field GeometryColumn.columns.crs of type geoparquet.Proj

The message still contains some ugly detail (cannot unmarshal string into Go struct field GeometryColumn.columns.crs of type geoparquet.Proj), but at least this is prefixed with a hint (try running describe with the --metadata-only flag).

When you do that, you see this output:

# gpq describe counties.parquet --metadata-only
{"primary_column": "geometry", "columns": {"geometry": {"crs": "GEOGCRS[\"WGS 84\",DATUM[\"World Geodetic System 1984\",ELLIPSOID[\"WGS 84\",6378137,298.257223563,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],CS[ellipsoidal,2],AXIS[\"geodetic latitude (Lat)\",north,ORDER[1],ANGLEUNIT[\"degree\",0.0174532925199433]],AXIS[\"geodetic longitude (Lon)\",east,ORDER[2],ANGLEUNIT[\"degree\",0.0174532925199433]],USAGE[SCOPE[\"unknown\"],AREA[\"World\"],BBOX[-90,-180,90,180]],ID[\"EPSG\",4326]]", "encoding": "WKB", "bbox": [-179.231086, -14.601813, 179.859681, 71.441059]}}, "schema_version": "0.1.0", "creator": {"library": "geopandas", "version": "0.8.1"}}

That is a straight dump of the geo metadata key value. The problem is that the "crs" member is a string instead of an object. It looks like geopandas is encoding the crs as a WKT string instead of a PROJJSON object.

The output from validate now includes a bit more detail on what is wring:

# gpq validate counties.parquet                

Summary: Passed 5 checks, failed 3 checks, 12 checks not run.

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✗ metadata must include a "version" string
   ↳ missing "version" in metadata
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ! column metadata must include the "primary_column" name
   ↳ not checked
 ✓ column metadata must include a valid "encoding" string
 ✗ column metadata must include a "geometry_types" list
   ↳ missing "geometry_types" for column "geometry"
 ✗ optional "crs" must be null or a PROJJSON object
   ↳ expected "crs" for column "geometry" to be an object, got a string: "GEOGCRS[\"WGS 84\",DATUM[\"World Geodetic System 1984\",ELLIPSOID[\"WGS 84\",6378137,298.257223563,LENGTHUNIT[\"metre\",1]]],PRIMEM[\"Greenwich\",0,ANGLEUNIT[\"degree\",0.0174532925199433]],CS[ellipsoidal,2],AXIS[\"geodetic latitude (Lat)\",north,ORDER[1],ANGLEUNIT[\"degree\",0.0174532925199433]],AXIS[\"geodetic longitude (Lon)\",east,ORDER[2],ANGLEUNIT[\"degree\",0.0174532925199433]],USAGE[SCOPE[\"unknown\"],AREA[\"World\"],BBOX[-90,-180,90,180]],ID[\"EPSG\",4326]]"
 ! optional "orientation" must be a valid string
   ↳ not checked
 ! optional "edges" must be a valid string
   ↳ not checked
 ! optional "bbox" must be an array of 4 or 6 numbers
   ↳ not checked
 ! optional "epoch" must be a number
   ↳ not checked
 ! geometry columns must not be grouped
   ↳ not checked
 ! geometry columns must be stored using the BYTE_ARRAY parquet type
   ↳ not checked
 ! geometry columns must be required or optional, not repeated
   ↳ not checked
 ! all geometry values match the "encoding" metadata
   ↳ not checked
 ! all geometry types must be included in the "geometry_types" metadata (if not empty)
   ↳ not checked
 ! all polygon geometries must follow the "orientation" metadata (if present)
   ↳ not checked
 ! all geometries must fall within the "bbox" metadata (if present)
   ↳ not checked

The new part there is expected "crs" for column "geometry" to be an object, got a string.

cholmes commented 9 months ago

Awesome, looks great - thanks for the fast response. And yeah, I figure we're going to see more and more non-compliant parquet / geoparquet, so it'll be a continual process to get the errors to be really informative.