unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

Config option `strict = "filter"` does not work on spark dataframes #996

Closed nwoodbury closed 1 year ago

nwoodbury commented 1 year ago

Describe the bug When using a SchemaModel on a pyspark dataframe with the config option strict = "filter" set, a TypeError: drop() got an unexpected keyword argument 'inplace' is raised.

Code Sample, a copy-pastable example

from pandera import Field, SchemaModel
from pandera.typing.pyspark import DataFrame, Series
from pyspark.sql import SparkSession

class DemoSchema(SchemaModel):
    col: Series[str] = Field()

    class Config:
        strict = "filter"

spark = SparkSession.builder.appName("demo").getOrCreate()
columns = ["col", "toDrop"]
data = [("a", "b"), ("c", "d")]
sdf = spark.createDataFrame(data).toDF(*columns)

validated = DataFrame[DemoSchema](sdf)

Expected behavior

The code to execute without error, with validated being a pandas-on-spark dataframe with "col" as its only column.

Desktop (please complete the following information):

Additional context

The offending line is 636 of pandera/schemas.py, which is check_obj.drop(labels=filter_out_columns, inplace=True, axis=1). The problem is that the pandas-on-spark object does not have an inplace argument, leading to the error. A possible general-purpose fix might be to replace the line with check_obj = check_obj.drop(labels=filter_out_columns, axis=1), however, this may have unintended consequences such as efficiency.

cosmicBboy commented 1 year ago

hi @nwoodbury I'd be open to support the inplace=False invocation of the drop method.

fix might be to replace the line with check_obj = check_obj.drop(labels=filter_out_columns, axis=1),

would you be open to making that change? It would also be useful to see some benchmarking on pandas dataframes to see what the impact would be.

In the worst case we can special-case the pyspark case similar to how it's done here.

nwoodbury commented 1 year ago

@cosmicBboy sure, I'll take a crack at it. The special case treatment might be ideal allowing the library to maintain efficiency where possible.

nwoodbury commented 1 year ago

@cosmicBboy I've created Pull Request #1001 to resolve this issue.

Note, I am unable to run the full nox test suite locally due to versioning problems in conda. I get the following errors (only the last few lines added here since my terminal ran out of space to contain the full output) running a conda install in a fresh build of the continuumio/miniconda3:latest Docker image:

...

Package scipy conflicts for:
geopandas -> mapclassify[version='>=2.4.0'] -> scipy[version='>=0.11|>=1.0']
scipy

Package glog conflicts for:
pyarrow -> glog
pyarrow -> arrow-cpp==9.0.0=py311h5fd143a_10_cpu -> glog[version='>=0.3.5,<0.3.6.0a0|>=0.4.0,<0.5.0a0|>=0.5.0,<0.6.0a0|>=0.6.0,<0.7.0a0|>=0.4.0,<1.0a0']

Package importlib_resources conflicts for:
pre_commit -> importlib_resources
hypothesis[version='>=5.41.1'] -> backports.zoneinfo[version='>=0.2.1'] -> importlib_resources
pre_commit -> virtualenv[version='>=15.2'] -> importlib_resources[version='>=1.0|>=1.0,<2']
frictionless -> jsonschema[version='>=2.5'] -> importlib_resources[version='>=1.4.0']
nox -> virtualenv[version='>=14.0.0'] -> importlib_resources[version='>=1.0|>=1.0,<2']

Package snappy conflicts for:
pyarrow -> snappy
pyarrow -> arrow-cpp==9.0.0=py311h5fd143a_10_cpu -> snappy[version='>=1.1.7,<2.0.0.0a0|>=1.1.8,<2.0a0|>=1.1.9,<2.0a0|>=1.1.7,<2.0a0']

Package pypy3.7 conflicts for:
twine -> pypy3.7[version='>=7.3.3']
twine -> keyring[version='>=15.1'] -> pypy3.7[version='7.3.*|7.3.3.*|7.3.4.*|7.3.5.*|7.3.7.*|>=7.3.5|>=7.3.7']

Package nomkl conflicts for:
scipy -> openblas[version='>=0.3.3,<0.3.4.0a0'] -> nomkl==3.0=0
pandas[version='>=1.2.0'] -> numexpr[version='>=2.7.1'] -> nomklThe following specifications were found to be incompatible with your system:

  - feature:/linux-64::__glibc==2.31=0
  - feature:/linux-64::__unix==0=0
  - feature:|@/linux-64::__glibc==2.31=0
  - feature:|@/linux-64::__unix==0=0
  - asv -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - black[version='>=22.1.0'] -> click[version='>=8.0.0'] -> __unix
  - black[version='>=22.1.0'] -> click[version='>=8.0.0'] -> __win
  - distributed -> click[version='>=6.6'] -> __unix
  - distributed -> click[version='>=6.6'] -> __win
  - frictionless -> click[version='>=6.6'] -> __unix
  - frictionless -> click[version='>=6.6'] -> __win
  - hypothesis[version='>=5.41.1'] -> click[version='>=7.0'] -> __unix
  - hypothesis[version='>=5.41.1'] -> click[version='>=7.0'] -> __win
  - modin -> ray-core[version='>=1.0'] -> __glibc[version='>=2.17,<3.0.a0']
  - mypy[version='<=0.971'] -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - numpy[version='>=1.19.0'] -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - pandas[version='>=1.2.0'] -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - protobuf[version='<=3.20.3'] -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - pyarrow -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
  - pydantic -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - python=3.8 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - pyyaml[version='>=5.1'] -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - scipy -> libgfortran-ng -> __glibc[version='>=2.17']
  - shapely -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
  - uvicorn -> click[version='>=7.*'] -> __unix
  - uvicorn -> click[version='>=7.*'] -> __win
  - wrapt -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']

Your installed version is: not available
cosmicBboy commented 1 year ago

Note, I am unable to run the full nox test suite locally due to versioning problems in conda.

Thanks @nwoodbury, would you mind opening a bug report issue for this?

nwoodbury commented 1 year ago

@cosmicBboy done. See Issue #1002

cosmicBboy commented 1 year ago

fixed by #1001