shawnbrown / datatest

Tools for test driven data-wrangling and data validation.
Other
294 stars 13 forks source link

NaT issue #55

Open Belightar opened 3 years ago

Belightar commented 3 years ago

Greetings, @shawnbrown

to be short,

my pd.Series is like: Date 0 NaT 1 NaT 2 NaT 3 2010-12-31 4 2010-12-31 Name: Date, dtype: datetime64[ns] the type of NaT is: <class 'pandas._libs.tslibs.nattype.NaTType'> when I use the following code:

with accepted(Extra(pd.NaT)): validate(data, requirement)

I found that it the NaTs can not be recognized. I tried many types of Extra and tried using function but all faild.

here I need your help. Thanks for your work.

shawnbrown commented 3 years ago

Hello--thanks for filing this issue. I'd like to replicate your problem as accurately as I can before I start addressing the issue.

I have some sample code below but I'm not sure what you're using as the requirement:

from datetime import datetime
import pandas as pd
from datatest import validate

data = pd.Series([
    None,
    None,
    None,
    datetime(2010, 12, 31),
    datetime(2010, 12, 31),
])

requirement = ???  # <- What is this?
validate(data, requirement)

Can you tell me what your requirement value is?

Belightar commented 3 years ago

Thanks for you reply.

from datetime import datetime, timedelta
import pandas as pd
from datatest import validate, accepted, Extra

data = pd.Series([
    None,
    None,
    None,
    datetime(2010, 12, 31),
    datetime(2010, 12, 31),
])

Today = datetime.today()
Tomorrow = Today + timedelta(days=1)

def date_requirement(var_datetime):
    return pd.Timestamp(year=2000, month=1, day=1) < var_datetime < \
            pd.Timestamp(year=Tomorrow.year, month=Tomorrow.month, day=Tomorrow.day)

with accepted(Extra(pd.NaT)):
    validate(data, date_requirement)

Here I want to accept the NaT type data. I tried pd.NaT, np.datetime64('NaT'), or NanToken method mentioned in the document and the results are the same:

datatest.ValidationError: does not satisfy date_requirement() (3 differences): [
    Invalid(numpy.datetime64('NaT')),
    Invalid(numpy.datetime64('NaT')),
    Invalid(numpy.datetime64('NaT')),
]
shawnbrown commented 3 years ago

Ah, OK. As a stopgap, you can use the accepted.args() method together with the pd.isna() function:

...

with accepted.args(pd.isna):
    validate(data, date_requirement)

The accepted.args() method accepts differences whose args satisfy a given predicate. And by using pd.isna() as the predicate, you can accept differences that contain NaT, NaN, or other "missing value" objects.

For a longer term solution, I want to bring the handling of these NaT values inline with how datatest handles other NaN values (as documented here). I will follow up when I have addressed this issue more thoroughly.

Belightar commented 3 years ago

Thank you so much. Your code works well in my project. And yes, I also used pd.isna to judge whether it is pd.NaT or not. (Is this the only way?) I simply droped those rows then do the datatest. I've used python and programed for 3 years and haven't realized there're differences among bool, np.bool_ or pd.NaT, pd.Nan, np.nan, nan before. I've learnt alot from your work, and thanks for your patience again.

shawnbrown commented 3 years ago

I'm glad you found it helpful. I noticed that your date_requirement() function is checking for an interval. If it suits your needs, you could also use the validate.interval() method:

...

begin_date = pd.Timestamp(year=2000, month=1, day=1)
tomorrow = pd.Timestamp(datetime.today() + timedelta(days=1))

with accepted.args(pd.isna):
    validate.interval(data, begin_date, tomorrow)

One difference with this approach is that time differences trigger Deviation objects that contain a timedelta. There are some how-to documents for date handling that you mignt find helpful as well: