planetlabs / gpq

Utility for working with GeoParquet
https://planetlabs.github.io/gpq/
Apache License 2.0
133 stars 7 forks source link

Specify the dst crs in convert? #160

Open FlorisCalkoen opened 3 months ago

FlorisCalkoen commented 3 months ago

I don't think gpq currently contains a method to specify the target crs. Also I see that by default you use "OGC:CRS84", what is your rationale for that? Why not, for example, use "EPSG:4326"?

I'll add a little bit of context on my use case. So I just used gpq to convert a 'big' collection of parquet files to geoparquet by simply doing gpq convert non-geo.parquet valid-geo.parquet in a for loop. Further in my processing chain I load these geoparquet files using GeoPandas, but I ran into an issue because when the crs == "OGC:CRS84" it cannot be converted to epgs. Although it's expected behaviour I'm mostly just curious why you use "OGC:CRS84" instead of "EPSG:4326".

gdf = gpd.read_parquet("valid-geo.parquet")
print(gdf.crs.to_epsg()) # None
print(gdf.to_crs(4326).to_epsg()) # 4326

I'll probably change my routines from gdf.crs.to_epsg() to gdf.crs.to_string(), but I guess that several others rely on to_epsg() as well when using GeoPandas, so I thought it's worth opening a discussion point here.

FlorisCalkoen commented 3 months ago

Following up, my guess is that dask_geopandas is also struggling to read GeoParquet files thave have been converted with gpq due to a similar issue/decision that has been made in the gpq crs conversion/specification. See example below:

storage_options = {"account_name": <account_name> "credential": <token>}
href = "<protocol>/<container>/<prefix>/valid-geo.parquet"
gdf = dask_geopandas.read_parquet(href, storage_options=storage_options)
Python traceback ```python-traceback --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:136, in CreationDispatch.register_inplace..decorator..wrapper(*args, **kwargs) 135 try: --> 136 return func(*args, **kwargs) 137 except Exception as e: File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/core.py:538, in read_parquet(path, columns, filters, categories, index, storage_options, engine, use_nullable_dtypes, dtype_backend, calculate_divisions, ignore_metadata_file, metadata_task_size, split_row_groups, blocksize, aggregate_files, parquet_file_extension, filesystem, **kwargs) 536 blocksize = None --> 538 read_metadata_result = engine.read_metadata( 539 fs, 540 paths, 541 categories=categories, 542 index=index, 543 use_nullable_dtypes=use_nullable_dtypes, 544 dtype_backend=dtype_backend, 545 gather_statistics=calculate_divisions, 546 filters=filters, 547 split_row_groups=split_row_groups, 548 blocksize=blocksize, 549 aggregate_files=aggregate_files, 550 ignore_metadata_file=ignore_metadata_file, 551 metadata_task_size=metadata_task_size, 552 parquet_file_extension=parquet_file_extension, 553 dataset=dataset_options, 554 read=read_options, 555 **other_options, 556 ) 558 # In the future, we may want to give the engine the 559 # option to return a dedicated element for `common_kwargs`. 560 # However, to avoid breaking the API, we just embed this 561 # data in the first element of `parts` for now. 562 # The logic below is inteded to handle backward and forward 563 # compatibility with a user-defined engine. File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:57, in GeoArrowEngine.read_metadata(cls, fs, paths, **kwargs) 55 @classmethod 56 def read_metadata(cls, fs, paths, **kwargs): ---> 57 meta, stats, parts, index = super().read_metadata(fs, paths, **kwargs) 59 gather_spatial_partitions = kwargs.pop("gather_spatial_partitions", True) File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/dataframe/io/parquet/arrow.py:549, in ArrowDatasetEngine.read_metadata(cls, fs, paths, categories, index, use_nullable_dtypes, dtype_backend, gather_statistics, filters, split_row_groups, blocksize, aggregate_files, ignore_metadata_file, metadata_task_size, parquet_file_extension, **kwargs) 548 # Stage 2: Generate output `meta` --> 549 meta = cls._create_dd_meta(dataset_info) 551 # Stage 3: Generate parts and stats File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:103, in GeoArrowEngine._create_dd_meta(cls, dataset_info, use_nullable_dtypes) 99 raise ValueError( 100 "No dataset parts discovered. Use dask.dataframe.read_parquet " 101 "to read it as an empty DataFrame" 102 ) --> 103 meta = cls._update_meta(meta, schema) 104 return meta File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:77, in GeoArrowEngine._update_meta(cls, meta, schema) 74 """ 75 Convert meta to a GeoDataFrame and update with potential GEO metadata 76 """ ---> 77 return _update_meta_to_geodataframe(meta, schema.metadata) File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/arrow.py:36, in _update_meta_to_geodataframe(meta, schema_metadata) 35 geometry_column_name = geo_meta["primary_column"] ---> 36 crs = geo_meta["columns"][geometry_column_name]["crs"] 37 geometry_columns = geo_meta["columns"] KeyError: 'crs' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) Cell In[36], line 6 3 with fsspec.open(href, mode="rb", **storage_options) as f: 4 gpd.read_parquet(f) ----> 6 dask_geopandas.read_parquet(href, storage_options=storage_options) File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask_geopandas/io/parquet.py:112, in read_parquet(*args, **kwargs) 111 def read_parquet(*args, **kwargs): --> 112 result = dd.read_parquet(*args, engine=GeoArrowEngine, **kwargs) 113 # check if spatial partitioning information was stored 114 spatial_partitions = result._meta.attrs.get("spatial_partitions", None) File ~/mambaforge/envs/jl-full/lib/python3.11/site-packages/dask/backends.py:138, in CreationDispatch.register_inplace..decorator..wrapper(*args, **kwargs) 136 return func(*args, **kwargs) 137 except Exception as e: --> 138 raise type(e)( 139 f"An error occurred while calling the {funcname(func)} " 140 f"method registered to the {self.backend} backend.\n" 141 f"Original Message: {e}" 142 ) from e KeyError: "An error occurred while calling the read_parquet method registered to the pandas backend.\nOriginal Message: 'crs'" ```
tschaub commented 3 months ago

It looks like you are running into an issue with dask-geopandas. The crs is optional in a GeoParquet geometry column. It looks like dask-geopandas assumes it will be present here https://github.com/geopandas/dask-geopandas/blob/3489a1cbafbeda3c0d4493133112969268e58d66/dask_geopandas/io/arrow.py#L36

I think this is the same issue https://github.com/geopandas/dask-geopandas/issues/270