opengeospatial / geoparquet

Specification for storing geospatial vector data (point, line, polygon) in Parquet
https://geoparquet.org
Apache License 2.0
838 stars 57 forks source link

Read version number from the schema #159

Closed tschaub closed 1 year ago

tschaub commented 1 year ago

The goal of this branch is to reduce the number of places the version number is repeated. Here is a summary of the changes and motivation for each:

kylebarron commented 1 year ago

Looks like https://github.com/opengeospatial/geoparquet/pull/24 was what added environment.yml (cc @TomAugspurger). I personally prefer poetry and pip/PyPI for packages, but maybe on Windows Conda is still easier.

tschaub commented 1 year ago

One issue with the script that generates the example.parquet is that it depends on whatever geopandas.datasets.get_path("naturalearth_lowres") resolves to. I've seen that the resulting parquet dataset can vary. For example, the "crs" data may have a "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json" or "$schema": "https://proj.org/schemas/v0.4/projjson.schema.json". I wonder if this depends on the specific version of geopandas. If so, can Poetry help us get something more repeatable here?

tschaub commented 1 year ago

It looks like the example.parquet file differs, but that the geo metadata is the same. Perhaps we don't expect a bitwise match for parquet files generated on different systems.

kylebarron commented 1 year ago

Perhaps we don't expect a bitwise match for parquet files generated on different systems.

Parquet writers usually include the name of the implementation in the metadata, so it won't be bitwise equal across implementations

import pandas as pd
import pyarrow.parquet as pq
df = pd.DataFrame({'a': [1, 2, 3, 4]})
df.to_parquet('tmp.parquet')
pq.read_metadata('tmp.parquet').created_by
# 'parquet-cpp-arrow version 10.0.1'
jorisvandenbossche commented 1 year ago

One issue with the script that generates the example.parquet is that it depends on whatever geopandas.datasets.get_path("naturalearth_lowres") resolves t

It's probably good to rely on this for the example data. For one, we don't guarantee it to be stable as we might update it with new naturalearth data releases (which is maybe not a problem for our use case), but the dataset also has some political problems (like Crimea belonging to Russia, which is a reason we would prefer to remove it from geopandas on the long term).

Maybe we can find some other online data source to use as example?

jorisvandenbossche commented 1 year ago

I personally prefer conda, and would rather not use poetry. But I also don't think it is worth supporting both since in the end it is just for running this script, which is not something one will have to do often yourself. So I am OK with continue to use poetry here.

But we should maybe avoid relying on GDAL (fiona) then to read in the example data to keep the installation of the env for the scripts simpler (eg if we read from a geojson file, GDAL is not necessarily needed).

tschaub commented 1 year ago

But we should maybe avoid relying on GDAL (fiona) then to read in the example data to keep the installation of the env for the scripts simpler (eg if we read from a geojson file, GDAL is not necessarily needed).

I agree that we could reduce dependencies and simplify creation of the example file. And I agree that something besides political boundaries would be good. Let's take this up separately from the changes here.

tschaub commented 1 year ago

@kylebarron or @jorisvandenbossche - Any more comments on these changes? We definitely have more to discuss and probably do related to CI, dependencies, example data, etc. But I'm wondering if we can get in the changes that reduce the duplicated version identifiers to simplify the release process (#146).

I updated the description of the PR to include more details on the changes and their motivation in case that helps.