planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
159 stars 8 forks source link

About compression: is it normal for it to be so low? #143

Open aborruso opened 9 months ago

aborruso commented 9 months ago

Hi, I'm testing gpq on the official administrative boundaries of Italy. The source file is this zip file: https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip

It has a folder structure, with shapefiles in it. I am doing the tests on the Limiti01012023/Com01012023/Com01012023_WGS84.shp file:

They are almost equal in size. Some notes:

I know, I can't compare these outputs, however, it seems to me very limited compression in gpq output. Is it normal? Am I doing something wrong?

Below the way I have tested all.

Thank you

wget -O file.zip "https://www.istat.it/storage/cartografia/confini_amministrativi/non_generalizzati/2023/Limiti01012023.zip"

unzip -o file.zip -d .

ogr2ogr -f GeoJSON -t_srs EPSG:4326 comuni.geojson Limiti01012023/Com01012023/Com01012023_WGS84.shp -lco "RFC7946=YES"

gpq convert --compression="gzip" --max 1000 --from="geojson" comuni.geojson comuni_compressed.parquet

gpq convert --compression="uncompressed" --max 1000 --from="geojson" comuni.geojson comuni_uncompressed.parquet

ogr2ogr -t_srs EPSG:4326 Com01012023_WGS84.shp.zip Limiti01012023/Com01012023/Com01012023_WGS84.shp
aborruso commented 9 months ago

I have tested the parquet gzip compression using gdal, and I have a 49 MB output