planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
138 stars 8 forks source link

Support for convert to stdout #78

Closed bdon closed 10 months ago

bdon commented 10 months ago

I'd like to do something like this:

gpq convert Cairo_Governorate.parquet --stdout --to=geojson | tippecanoe -o Cairo_Governorate.pmtiles --drop-densest-as-needed

Would this functionality be useful? It would require some changes in convert.go to allow for a blank positional output argument.

tschaub commented 10 months ago

Hey @bdon - nice idea. I put together #79 to make all the commands optionally work with stdin/stdout.

If you omit the output arg in the convert command, it writes to stdout. Not as explicit as a --stdout arg. Hopefully isn't trying to be too tricky.

bdon commented 10 months ago

All together!

curl https://data.source.coop/cholmes/google-open-buildings/geoparquet-admin1/country=EGY/Cairo_Governorate.parquet | ./gpq convert --from=geoparquet --to=geojson | tippecanoe -o buildings.pmtiles --force --drop-densest-as-needed
tschaub commented 10 months ago

Included in the v0.15.0 release (brew update && brew install planetlabs/tap/gpq or download from the release page).

tschaub commented 10 months ago

@bdon - you'll probably notice that this needs to buffer the whole file since the Parquet metadata is in the footer. But that suggests another enhancement - to accept a URL for the input. Then if ranged reads are supported, the metadata could be read first (and then maybe only buffer one data page at a time).

bdon commented 10 months ago

@tschaub have you looked into using https://gocloud.dev for reading Parquet?

For https://github.com/protomaps/go-pmtiles/blob/main/pmtiles/extract.go#L276 I use only the blob functionality, but that means it supports GCP, Azure, and S3-compatible blob storage with credentials out of the box. I had to add a layer of abstraction to handle public unauthenticated HTTP URLs but it was otherwise simple.

tschaub commented 10 months ago

I've used similar libs, but not yet gocloud.dev, will check it out.

My ideal would be a multi-cloud blob reader that implemented io.ReadSeeker and io.ReaderAt (I know this isn't efficient for all providers, but it is possible - with lots of guessing to know how much to buffer for the seeker reads).

bdon commented 10 months ago

For PMTiles it uses bucket.NewRangeReader without any guessing - it downloads the entire (compressed) relevant part of the index in advance, and then pre-merges request ranges to avoid thousands of small requests, before fetching any actual "features" (tiles).

Is a similar batching behavior needed to be effective for geoparquet? I haven't delved deeply into actual reader implementations yet.