planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
151 stars 8 forks source link

Support Overture parquet conversion to GeoParquet #57

Closed cholmes closed 1 year ago

cholmes commented 1 year ago

The new overture maps has parquet in WKB, but when I try to convert it I get:

% gpq convert 20230725_211237_00132_5p54t_25816df1-b864-49c0-a9a3-a13da4f37a90 out2.parquet --from=parquet --to=geoparquet
gpq: error: encoding parquet data page: encoding not supported for type BYTE_ARRAY

Sample data is at https://storage.googleapis.com/open-geodata/ch/20230725_211237_00132_5p54t_3b7d7eb3-dd9c-442a-a9b9-404dc936c5d9

mtravis commented 1 year ago

@cholmes

I've downloaded the admin data and parsed it through DuckDB

db.execute ("""
COPY (
select * 
from '**/*.parquet'
WHERE adminLevel = 2
isocountrycodealpha2 is not null
) TO 'admin-countries.parquet'
""")

With this I can then convert to geoparquet using gpq.

I guess this should just work without the need to use DuckDB though?

cholmes commented 1 year ago

@mtravis - funny, I just came here to make the same comment, as I had noticed that too.

Yeah, running it through DuckDB in most any way seems to work fine, so it seems to not be anything fundamental with the structure of that data.

tschaub commented 1 year ago

I get an error trying to read this file using the Arrow libs directly. I've ticketed this as https://github.com/apache/arrow/issues/37968.

I'll work on trying to narrow it down.

tschaub commented 1 year ago

This now works in the latest release. If using brew, you can brew update && brew install planetlabs/tap/gpq to install the latest. And you can run gpq version to see what version you have installed.

# the file above is now converted to valid geoparquet
gpq convert overture.parquet --to geoparquet | gpq validate
tschaub commented 1 year ago

In case it is of interest to Overture users, I opened a discussion about the Parquet schema here: https://github.com/OvertureMaps/schema/discussions/55

Basically, the current schema for names and sources is not as specific as it could be (allowing arbitrary properties for names for example instead of restricting it to the common, official, alternate, and short described in the JSON Schema). If you think a more specific schema would be harmful or helpful, please chime in.