unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

SchemaModel.validate sample, head, tail only work with hashable column types #682

Closed gordonhart closed 2 years ago

gordonhart commented 2 years ago

The SchemaModel.validate arguments sample, head, and tail only work with hashable column types. Sometimes it's desirable to put unhashable types (like numpy.ndarray) into the columns of a DataFrame. The presence of such a column shouldn't break schema validation.

Code Sample, a copy-pastable example

import numpy as np
import pandas as pd
import pandera as pa

class ExampleSchema(pa.SchemaModel):
    unhashable: pa.typing.Series[pa.Object] = pa.Field()

df = pd.DataFrame(dict(unhashable=[np.random.rand(4) for _ in range(10)]))

print("validating without subsampling...")
ExampleSchema.validate(df)

print("validating with 'sample'...")
try:
    ExampleSchema.validate(df, sample=1)
    raise RuntimeError("unreachable")
except TypeError as e:
    print(f"    expected exception thrown: {e}")

Output:

validating without subsampling...
validating with 'sample'...
    expected exception thrown: unhashable type: 'numpy.ndarray'

Expected behavior

Subsampling should work when unhashable types are present as columns.

Additional Comments

pandera is a great module. Willing to look into this when I can find the time.

cosmicBboy commented 2 years ago

thanks for finding this bug @gordonhart ! I added a comment to your PR

gordonhart commented 2 years ago

683 merged, closing.