planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
138 stars 8 forks source link

Better warnings / info in `describe` on non-compliant GeoParquet #87

Closed cholmes closed 9 months ago

cholmes commented 9 months ago

The new describe is awesome, but if I put in non-compliant geoparquet there's little messaging that I have a file that's not quite right:

 gpq describe taxi.parquet 
╭────────────┬────────┬─────────────────────────────────┬────────────┬─────────────╮
│ COLUMN     │ TYPE   │ ANNOTATION                      │ REPETITION │ COMPRESSION │
├────────────┼────────┼─────────────────────────────────┼────────────┼─────────────┤
│ OBJECTID   │ int32  │ int(bitwidth=32, issigned=true) │ 0..1       │ snappy      │
│ Shape_Leng │ double │                                 │ 0..1       │ snappy      │
│ Shape_Area │ double │                                 │ 0..1       │ snappy      │
│ zone       │ binary │ string                          │ 0..1       │ snappy      │
│ LocationID │ int32  │ int(bitwidth=32, issigned=true) │ 0..1       │ snappy      │
│ borough    │ binary │ string                          │ 0..1       │ snappy      │
│ geom       │ binary │                                 │ 0..1       │ snappy      │
├────────────┼────────┼─────────────────────────────────┴────────────┴─────────────┤
│ ROWS       │ 262    │                                                            │
╰────────────┴────────┴────────────────────────────────────────────────────────────╯

If I convert it then I get an additional row:

├──────────┼────────┼────────────┴────────────┴─────────────┴──────────┴────────────────┴────────┴────────┤
│ ROWS     │ 3233   │                                                                                     │
│ VERSION  │ 1.0.0  │                                                                                     │
╰──────────┴────────┴─────────────────────────────────────────────────────────────────────────────────────╯

I think it'd be nice to try to always put something about the compliance. Like maybe always have VERSION, but if it's not compliant than say non-compliant (might also be nice to call it 'geoparquet version' or something). It could also be nice to say if it's a 'compatible parquet' file, like it's geom and data looks like 4326, and recommend people use gpq convert.

tschaub commented 9 months ago

Thanks for reporting these issues. If you can link to the invalid files, I can see about improving the output.

I did update the table output so it says "GeoParquet Version" now instead of "VERSION". But this only shows up if the geo metadata is valid. The describe output also now includes some error messages after the table if the geo metadata is invalid. See details here https://github.com/planetlabs/gpq/issues/86#issuecomment-1747406665.

The describe command doesn't repeat all of the logic in the validate command, and I'd like to avoid munging the two things together. I'll make another change to describe that suggests running validate if it looks like you are trying to describe GeoParquet (if there is a "geo" key in the file key/value metadata) but it is not valid.

tschaub commented 9 months ago

I added some additional detail to the describe output in #87. This covers two cases where the input file is not valid GeoParquet: 1) if it is missing the "geo" metadata key and 2) if the command fails to parse the "geo" metadata value.

This doesn't cover all cases of invalid GeoParquet. For example, the "geo" metadata can match the JSON schema, but the data might be missing the geometry column. Or the geometry column may not be of the correct physical type. Or the geometry column could be a group or repeated. Or the metadata might have a list of geometry types but one of the geometry values might not be in that list. Or the metadata might have a bbox but one of the geometry values might fall outside that bbox. These are examples of things that the validate command checks. The describe command doesn't validate, so it isn't going to catch all these and other cases of invalid GeoParquet. Hope this makes sense.

cholmes commented 9 months ago

Cool, that file I used is: taxi.parquet.zip (just zipped it so I could upload to github).

And yeah, makes a lot of sense for describe to not repeat the logic as validate. Will try out the new version - I think the one thing I was thinking that I don't immediately see is for describe to just check if there's just not a geo key in the metadata.... And as I was editing this I think you edited your comment to make that clear - pretty sure we're on the same page, will try it out.

tschaub commented 9 months ago

With the latest changes, here is the describe output on that file:

# gpq describe taxi.parquet
╭────────────┬────────┬─────────────────────────────────┬────────────┬─────────────╮
│ COLUMN     │ TYPE   │ ANNOTATION                      │ REPETITION │ COMPRESSION │
├────────────┼────────┼─────────────────────────────────┼────────────┼─────────────┤
│ OBJECTID   │ int32  │ int(bitwidth=32, issigned=true) │ 0..1       │ snappy      │
│ Shape_Leng │ double │                                 │ 0..1       │ snappy      │
│ Shape_Area │ double │                                 │ 0..1       │ snappy      │
│ zone       │ binary │ string                          │ 0..1       │ snappy      │
│ LocationID │ int32  │ int(bitwidth=32, issigned=true) │ 0..1       │ snappy      │
│ borough    │ binary │ string                          │ 0..1       │ snappy      │
│ geom       │ binary │                                 │ 0..1       │ snappy      │
├────────────┼────────┴─────────────────────────────────┴────────────┴─────────────┤
│ Rows       │ 262                                                                 │
│ Row Groups │ 1                                                                   │
╰────────────┴─────────────────────────────────────────────────────────────────────╯
 ⚠️  Not a valid GeoParquet file (missing the "geo" metadata key). Run convert to try to convert it to GeoParquet.

Then, if you try to run convert on it (but forget the --input-primary-column flag):

# gpq convert taxi.parquet taxi-geo.parquet   
gpq: error: expected a geometry column named "geometry", use the --input-primary-column to supply a different primary geometry

When you do include detail about the primary column (convert taxi.parquet taxi-geo.parquet --input-primary-column geom), it appears to succeed.

However, if you then try to validate it:

# gpq validate taxi-geo.parquet

Summary: Passed 16 checks, failed 1 check, 3 checks not run.

 ✓ file must include a "geo" metadata key
 ✓ metadata must be a JSON object
 ✓ metadata must include a "version" string
 ✓ metadata must include a "primary_column" string
 ✓ metadata must include a "columns" object
 ✓ column metadata must include the "primary_column" name
 ✓ column metadata must include a valid "encoding" string
 ✓ column metadata must include a "geometry_types" list
 ✓ optional "crs" must be null or a PROJJSON object
 ✓ optional "orientation" must be a valid string
 ✓ optional "edges" must be a valid string
 ✓ optional "bbox" must be an array of 4 or 6 numbers
 ✓ optional "epoch" must be a number
 ✓ geometry columns must not be grouped
 ✓ geometry columns must be stored using the BYTE_ARRAY parquet type
 ✓ geometry columns must be required or optional, not repeated
 ✗ all geometry values match the "encoding" metadata
   ↳ invalid geometry in column "geom": wkb: invalid data
 ! all geometry types must be included in the "geometry_types" metadata (if not empty)
   ↳ not checked
 ! all polygon geometries must follow the "orientation" metadata (if present)
   ↳ not checked
 ! all geometries must fall within the "bbox" metadata (if present)
   ↳ not checked

The geom column is a byte array, but it looks like it might not be valid WKB.

DuckDB reports this using the original taxi.parquet:

select ST_GeomFromWKB(geom) from 'taxi.parquet';
Error: Invalid Error: Not implemented Error: Geometry type not supported

So convert succeeds because the column is of the expected type (a byte array). The command tries to be efficient in copying the input column to the output if it decides it doesn't need to do any transform (like reading WKT and writing WKB). So it doesn't validate the geometry bytes on the way through.

Still stuff to work on, but I'm going to cut a release with the current changes because I think they improve things.

cholmes commented 9 months ago

The geom column is a byte array, but it looks like it might not be valid WKB. DuckDB reports this using the original taxi.parquet:

Ah, interesting. So I made this one with a duckdb call: copy (select * from 'taxi_zones/taxi.fgb') TO 'taxi.parquet' with (format parquet); So I think the binary it wrote out was actually the DuckDB geometry directly into parquet. I think that's new for 0.9.0 in DuckDB - I believe in 0.8.1 it'd put everything in WKB when you read it in, so if you just wrote it out then it'd be WKB. So I was expecting that behavior - was used to writing out geometries 'with (format parquet)' and getting 'compatible' parquets. I also missed that the fgb named it 'geom' instead of 'geometry'.

So this particular wonky file should be 'solved' when DuckDB gets actual GeoParquet output directly.