shawnbrown / datatest

Tools for test driven data-wrangling and data validation.
Other
294 stars 13 forks source link

validation errors Extra(nan) or Invalid(nan) #49

Closed upretip closed 5 years ago

upretip commented 5 years ago

Shaun, I am trying your package to see if I can validate a csv file by reading it in pandas. I am getting Extra(nan) dt.validate.superset() or Invalid(nan) dt.validate() . Is there a way I can include those nan in my validation sets?

Error looks like

E     ValidationError: may contain only elements of given superset (10000 differences): [
            Extra(nan),
            Extra(nan),
            Extra(nan),

Note: I am reading this particular column as str

E       ValidationError: does not satisfy 'str' (10000 differences): [
            Invalid(nan),
            Invalid(nan),
            Invalid(nan),
            Invalid(nan),

Let me know if you find a solution or can help me debug

shawnbrown commented 5 years ago

I've been looking at this closely and discovered a handful of un-handled corner cases related to NaN values. Until I get this sorted, NaN values will have to be handled using a workaround—e.g., using the fillna() method to replace them with a proxy value.

As a stopgap, you could do the following:

NAN = object()

# Include NAN in the validation set.
data = df['A'].fillna(NAN)
validate.superset(data, {'x', 'y', 'z', NAN})

# Accept NAN as a difference.
data = df['A'].fillna(NAN)
with accepted(Invalid(NAN)):
    validate(data, str)

Going forward, I will file a related issue/bug for this with the goal of allowing the use of NaN values directly:

# Include NaN in the validation set.
validate.superset(data, {'x', 'y', 'z', np.nan})

# Accept NAN as a difference.
with accepted(Invalid(np.nan)):
    validate(data, str)

I'll post a follow-up to this issue once I have patched this behavior.

upretip commented 5 years ago

Thanks. I will follow this.

shawnbrown commented 5 years ago

This is done: ce71b345: Update predicate handling to better support NaN values. bee6aa84: Add NaN handling idioms to test_usecases.py. 32d3bb93: Add test_numbers_equal() to verify numeric comparison. e8435b15: Update difference behavior to support tuples containing NaNs. c962e04c: Change RequiredInterval to fail if arguments are NaN. c78f390c: Fix RequiredInterval to properly handle NaN differences. 4995510d: Update NaN use cases to highlight recommended pattern. fa2646ef: Add how-to documentation for working with NaN values.

shawnbrown commented 5 years ago

@upretip, I've just pushed some new "how to" docs that give detail regarding NaN validation and behavior. You can view it in the latest docs here:

How to Deal With NaN Values https://datatest.readthedocs.io/en/latest/how-to/nan-values.html

upretip commented 5 years ago

Thanks for the help. Closing this issue now!