planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
152 stars 8 forks source link

Support control over number of row groups as an option #65

Closed cholmes closed 1 year ago

cholmes commented 1 year ago

When converting to GeoParquet it can be useful to set more row groups, for more efficient querying on large files. See https://github.com/opengeospatial/geoparquet/discussions/183

GDAL's is 'ROW_GROUP_SIZE=: Defaults to 65536. Maximum number of rows per group.'

Which seems reasonable, though I was doing like 20k default size for my experiments, so we could consider having the default be less - I didn't see negative effects, but something I read said if you have lots of parquet files then smaller row group size can affect the times of getting stats on the whole set. I think I have like 500 individual parquet files, so perhaps if it's thousands or tens of thousands it comes into effect?

cholmes commented 1 year ago

Oh, other thing that would be nice is to maintain the number of row groups in a parquet to geoparquet conversion. I tried this and it didn't seem to.