unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.25k stars 302 forks source link

How to properly extend pandas_engine.DataType to support geopandas #693

Open roshcagra opened 2 years ago

roshcagra commented 2 years ago

How to properly extend pandas_engine.DataType

Hi! I'm trying to use pandera with GeoPandas - I think all I should need to do to make it work is add support for the geometry column by registering BaseGeometry as a DataType. However, I'm struggling to get it to work - any suggestions?

import dataclasses

import shapely.geometry.base
from pandera import SchemaModel, dtypes, Field
from pandera.engines import pandas_engine
from pandera.engines.pandas_engine import DataType
from pandera.typing import Series, DataFrame
from shapely.geometry import box

@pandas_engine.Engine.register_dtype
@dtypes.immutable
class BaseGeometry(DataType):
    type: shapely.geometry.base.BaseGeometry = dataclasses.field(default=None, init=False)

class GeoDataFrameSchema(SchemaModel):
    geometry: Series[BaseGeometry] = Field()

df = DataFrame[GeoDataFrameSchema]({'geometry': [box(0, 0, 100, 100), box(200, 0, 400, 200)]})
  File "typed_dataframes.py", line 21, in <module>
    df = DataFrame[GeoDataFrameSchema]({'geometry': [box(0, 0, 100, 100), box(200, 0, 400, 200)]})
  File "/usr/local/Cellar/python@3.8/3.8.12_1/Frameworks/Python.framework/Versions/3.8/lib/python3.8/typing.py", line 731, in __call__
    result.__orig_class__ = self
  File "/venv/lib/python3.8/site-packages/pandera/typing/common.py", line 118, in __setattr__
    self = schema_model.validate(self)
  File "/venv/lib/python3.8/site-packages/pandera/model.py", line 261, in validate
    cls.to_schema().validate(
  File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 485, in validate
    return self._validate(
  File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 659, in _validate
    error_handler.collect_error("schema_component_check", err)
  File "/venv/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 651, in _validate
    result = schema_component(
  File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 1986, in __call__
    return self.validate(
  File "/venv/lib/python3.8/site-packages/pandera/schema_components.py", line 223, in validate
    validate_column(check_obj, column_name)
  File "/venv/lib/python3.8/site-packages/pandera/schema_components.py", line 196, in validate_column
    super(Column, copy(self).set_name(column_name)).validate(
  File "/venv/lib/python3.8/site-packages/pandera/schemas.py", line 1919, in validate
    error_handler.collect_error(
  File "/venv/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'geometry' to have type None, got object
jeffzi commented 2 years ago

Hi @roshcagra, interesting question !

pandera.errors.SchemaError: expected series 'geometry' to have type None, got object

It's a little confusing to trace back the source of the error but here is the gist of it. Your BaseGeometry.type is None by default and your SchemaModel does not provide a type. During validation, pandera will call BaseGeometry.check(), inherited from pandas_engine.DataType, which leverages the type argument. The type is supposed to be something understood by pandas. You can pass arguments to the dtype with the following syntax:

class GeoDataFrameSchema(SchemaModel):
    geometry: Series[BaseGeometry] = Field(dtype_kwargs={"type": APPROPRIATE_TYPE}) # APPROPRIATE_TYPE = ?

I'm not familiar with geopandas or shapely, and I couldn't see usage of subclasses of shapely.geometry.base.BaseGeometry in geopandas getting started tutorial. Geopandas seems to only have a single dtype GeometryDtype. I would use it to benefit from the work that has already been done by geopandas.

import geopandas
import pandera as pa
from pandera.engines import pandas_engine

@pandas_engine.Engine.register_dtype(
    equivalents=[  # Let pandera know how to translate this data type from other objects
        "geometry",
        geopandas.array.GeometryDtype,
        geopandas.array.GeometryDtype(),
    ]
)
@pa.dtypes.immutable
class Geometry(pandas_engine.DataType):
    type = geopandas.array.GeometryDtype()

class GeoDataFrameSchema(pa.SchemaModel):
    geometry: pa.typing.Series[Geometry]
    BoroCode: pa.typing.Series[Geometry]  # should fail (contains int)
    BoroName: pa.typing.Series[Geometry]  # should fail (contains object)

gdf = geopandas.read_file(geopandas.datasets.get_path("nybb"))
gdf.info()
#> <class 'geopandas.geodataframe.GeoDataFrame'>
#> RangeIndex: 5 entries, 0 to 4
#> Data columns (total 5 columns):
#>  #   Column      Non-Null Count  Dtype   
#> ---  ------      --------------  -----   
#>  0   BoroCode    5 non-null      int64   
#>  1   BoroName    5 non-null      object  
#>  2   Shape_Leng  5 non-null      float64 
#>  3   Shape_Area  5 non-null      float64 
#>  4   geometry    5 non-null      geometry
#> dtypes: float64(2), geometry(1), int64(1), object(1)
#> memory usage: 328.0+ bytes

# verify that pandera recognizes geometry
print(repr(pandas_engine.Engine.dtype(gdf["geometry"].dtype)))
#> DataType(geometry)

GeoDataFrameSchema.validate(gdf, lazy=True)
#> Traceback (most recent call last):
#> ...
#> SchemaErrors: A total of 2 schema errors were found.
#> Error Counts
#> - schema_component_check: 2
#> Schema Error Summary
#>                                           failure_cases  n_failure_cases
#> schema_context column   check                                           
#> Column         BoroCode dtype('geometry')       [int64]                1
#>                BoroName dtype('geometry')      [object]                1
#> Usage Tip
#> Directly inspect all errors by catching the exception:
#> ```
#> try:
#>     schema.validate(dataframe, lazy=True)
#> except SchemaErrors as err:
#>     err.failure_cases  # dataframe of schema errors
#>     err.data  # invalid dataframe
#> ```

^ The draft above is not tested but works in this basic example.

@roshcagra Would you be interested in extending this snippet and contributing proper geopandas support? I'm sure other geopandas users would benefit from schema validation !

roshcagra commented 2 years ago

Thank you so much @jeffzi !

I would love to expand on this and contribute. What would be the best way to do that? Maybe build another package that depends on pandera and adds support for geopandas? I'm guessing we wouldn't want to add it directly here because you wouldn't want to take geopandas and shapely on as dependencies.

jeffzi commented 2 years ago

I would love to expand on this and contribute.

Awesome, thanks !

I think a module inside the core pandera repo should suffice. We have done this before by adding optional dependencies. See setup.py and strategies module which requires hypothesis to function.

Besides the GeometryDtype, are there GeoDataFrame specificities that are relevant to schema validation? If it's only the dtype, we could have the class in engines.pandas_engine and tests in tests/geopandas/test_geopandas.py (easier for CI to install appropriate dependencies).

Pinging @cosmicBboy to confirm the approach.

cosmicBboy commented 2 years ago

thanks for you help @roshcagra!

yes the approach described by @jeffzi is the way to go.

Out of curiosity, I have a few questions (as someone who hasn't used geopandas before):

I think for now a GeometryDtype that doesn't do any type coercion and simply does a type check would be a good first pass for geopandas support. Then, any additional checks on the geometry dtype column would be done via custom checks.

Let us know if you have any other questions re: contributing!

m-richards commented 2 years ago

Hey @cosmicBboy, I'm a contributor to geopandas and coincidentally just started looking at using geopandas with pandera today. But I might be able to give some clarity on those questions (sorry this is a longer write up than I thought it would be).

is there any meaningful way in which a GeometryDtype is coerced from some other raw format, for e.g. as "1" -> int("1") as "some raw value" -> GeometryDtype("some raw value")

First, there are are Well Known Text (WKT) and Well Known Binary (WKB) which I suppose are analogues of raw formats, but these are not coerced with astype or convert_dtypes, instead they're covered by classmethods: geopandas.GeoSeries.from_wkt, where GeoSeries is the geopandas subclass of a pandas.Series for geometry data.

For the second point,

does this operation happen geodataframe.astype({"geometry": GeometryDtype}) as a user of the library

there is a case of casting with astype like this, with an array-like of shapely geometries (and potentially also pygeos geometries, but that's tangential):

In [1]: from shapely.geometry import Point
In [2]: gdf = gpd.GeoDataFrame({'foo':[1,2], 'bar':[Point(1,1), Point(2,2)]}, geometry=[Point(1,1), Point(2,1)])
In [3]: gdf
Out[3]:
   foo          bar                 geometry
0    1  POINT (1 1)  POINT (1.00000 1.00000)
1    2  POINT (2 2)  POINT (2.00000 1.00000)

In [4]: gdf.dtypes
Out[4]:
foo            int64
bar           object
geometry    geometry
dtype: object

'bar' has object dtype (geometry has been converted properly because it is the designated "geometry column", which has special casting checks applied to it), which can be fixed with astype:

In [5]: gdf.astype({'bar':'geometry'})
Out[5]:
   foo                      bar                 geometry
0    1  POINT (1.00000 1.00000)  POINT (1.00000 1.00000)
1    2  POINT (2.00000 2.00000)  POINT (2.00000 1.00000)

In [6]: gdf.astype({'bar':'geometry'}).dtypes
Out[6]:
foo            int64
bar         geometry
geometry    geometry
dtype: object
In [7]: gdf.astype({'bar':gpd.array.GeometryDtype()}).dtypes
Out[7]:
foo            int64
bar         geometry
geometry    geometry
dtype: object

The other important thing this does is convert 'bar' from being a Series to a GeoSeries which has properties like e.g. area.

In [8]: type(gdf['bar'])
Out[8]: pandas.core.series.Series
In [9]: type(gdf.astype({'bar':gpd.array.GeometryDtype()})['bar'])
Out[10]: geopandas.geoseries.GeoSeries

So this can happen as a user of the library, but I would say it is possible, but not exactly common. Usually one is better of doing something like this

srs = gpd.GeoSeries([Point(1,1), Point(2,2)], crs='epsg:4326')
gdf = gdf = gpd.GeoDataFrame({'foo':[1,2], 'bar':srs}, geometry=[Point(1,1), Point(2,1)])

the advantage of which is that bar can be specified with a coordinate reference system (CRS), which encodes information about the projection of geometry on the earth's surface to a cartesian plane for e.g area and distance calculations.

I also see there are specific data types in shapely like Point and Polygon... would it make sense to have those as types as well?

Geopandas can store Points, Polygons, Multipoints, ... all in the same GeoSeries with the same extension array GeometryDtype. Perhaps there is value in validating those types explicitly for certain workloads. There is GeoSeries.geom_type which returns an object array where each row is Points, Polygons, Multipoints, ... but I've never needed to do this - I didn't really know that method existed until writing this. Usually the limiting factor in this aspect is the underlying geometry data source, most GIS file formats only support geometry columns of homogenous types, so validating this on the geopandas side doesn't tend to come up.

I'd be quite keen to see this in pandera, happy to help if I can - @roshcagra seems keen to get started so I won't duplicate effort there.

Also, just wanted to say that pandera has been a really useful tool, thanks for developing and improving it.

m-richards commented 2 years ago

Just adding on to the above question by @jeffzi

Besides the GeometryDtype, are there GeoDataFrame specificities that are relevant to schema validation?

There are two special NDFrame._metadata fields _metadata = ["_crs", "_geometry_column_name"] which perhaps could warrant special handling (but to be honest I don't really know how that would work on the pandera side) - and I also don't feel it's essential (for my use case, I want to specify a geometry column in my schema and that's probably enough (although knowing the crs isn't none would be nice).

cosmicBboy commented 2 years ago

Thanks for the detailed analysis @m-richards, and I'm glad you're finding pandera useful!

It seems like GeoSeries and GeoDataFrame already does a lot of heavy lifting in terms of checking types of GeometryDtype arrays, and adding support for GeometryDtype, which is @roshcagra's use case, would cover a majority of the type-checking use cases.

Once there's a pandera.Geometry data type, adding a coerce method that does astype("geometry") would be pretty straightforward.

Since pandera allows for parameterized dtypes some nice future work would be to support this kind of syntax:

Geometry  # array can contain any type of geometry
Geometry("Point", crs="epsg:4326")  # only points of a specific crs
Geometry("Polygon", crs=...)  # only polygons
Geometry("Multipoint", crs=...)  # only multipoints

# - or specific dtype classes -

Geometry
Point(crs=...)
Polygon(crs=...)
Multipoint(crs=...)

And then use custom pa.Checks / registering custom checks to implement more specific validation rules.

Let me know if you have any more thoughts @roshcagra @jeffzi @m-richards !

roshcagra commented 2 years ago

@cosmicBboy @jeffzi @m-richards

This is my first pass:

from typing import Union

import pandas as pd
import geopandas as gpd
import pandera as pa
from pandera.engines import pandas_engine
from pandera.typing import DataFrame
from pandera.typing.common import SeriesBase
from pandera.typing.pandas import T

GeoPandasObject = Union[gpd.GeoSeries, pd.Index, gpd.GeoDataFrame]

@pandas_engine.Engine.register_dtype(
    equivalents=[  # Let pandera know how to translate this data type from other objects
        "geometry",
        gpd.array.GeometryDtype,
        gpd.array.GeometryDtype(),
    ]
)
@pa.dtypes.immutable
class Geometry(pandas_engine.DataType):
    type = gpd.array.GeometryDtype()

    def coerce(self, data_container: pd.Series) -> gpd.GeoSeries:
        return gpd.GeoSeries.from_wkt(data_container)

class GeoSeries(SeriesBase[gpd.array.GeometryDtype], gpd.GeoSeries):
    """Representation of geopandas.GeoSeries, only used for type annotation."""
    pass

class GeoDataFrame(DataFrame[T], gpd.GeoDataFrame):
    """Representation of geopandas.GeoDataFrame, only used for type annotation."""
    pass

Let me know what you think!

jeffzi commented 2 years ago

@m-richards Thanks, appreciate you're taking the time to explain in details !

And then use custom pa.Checks / registering custom checks to implement more specific validation rules.

Yes, my question was about whether we'd need to add built-in checks to better support geopandas but it does not seem to be necessary.

Let me know if you have any more thoughts @roshcagra @jeffzi @m-richards !

It will be important to test that validate does return a geopandas Dataframe if presented one, and preserves the geopandas metadata attributes kindly listed by @m-richards.

@roshcagra Looking good so far! Tbh, tests will reveal potential issues. You can add a mapping dtype: string alias in test_dtypes: https://github.com/pandera-dev/pandera/blob/7664092b020288b245071c18b07e1df356ae1515/tests/core/test_dtypes.py#L86-L92

and examples here https://github.com/pandera-dev/pandera/blob/7664092b020288b245071c18b07e1df356ae1515/tests/core/test_dtypes.py#L152-L153

That will add your new Geometry data type to the test suite. test_dtypes.py is rather complicated, don't hesitate to let me know if you need help at any point.

I'm curious aboutSeriesBase, is that a rename of pandera.typing.Series?

@cosmicBboy We should probably factor out the basic data type tests to facilitate testing of new data types and even use for koalas/modin testing.

roshcagra commented 2 years ago

@jeffzi @cosmicBboy PR for this up here: https://github.com/pandera-dev/pandera/pull/698