pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.61k stars 17.9k forks source link

Strange implicit string to datetime coercion during pd.Series comparisons #23234

Open AbdouSeck opened 6 years ago

AbdouSeck commented 6 years ago

I can see this being a feature for people that trust that their string formatted dates are easily parsable. But when this implicit coercion of string formatted dates to datetime objects is neither documented nor fully guaranteed to yield the right thing (because 2/1/2018 can be either %d/%m/%Y or %m/%d/%Y formatted), it's hard for many to realize what's going on here. This is evidenced by this stackoverflow question.

import pandas as pd

df = pd.DataFrame(
    {
        'A': pd.to_datetime(['2012-03-12', '2012-03-13', '2012-03-14', '2012-03-15']),
        'B': [1, 2, 3, 4],
        'C': [1.1, 2.2, 3.3, 4.4]
    }
)

# Stuff that works fine
df.A == '2013-11-23'
df.A == ['2013-10-10', '2013-11-30', '2013-10-13', '2013-11-23']

# Stuff that does not work (exceptions raised)
df.A == '2013-13-23'
df.A == ['2013-10-10', '2013-11-30', '2013-10-13', '2013-13-23']

Problem description

There are at least 3 issues here:

  1. The following piece code raises a combination of and TypeError and ValueError exceptions due to failure in converting one of the string into a datetime object:

    df.A == ['2013-13-23'] * len(df)

    Nowhere in that stream of tracebacks is it mentioned that '2013-13-23' is the bad data. If this feature is here to stay, it would be nice if the first data value to fail was reported with the raised exceptions.

  2. The second issue is one of documentation. For such an opinionated coercion of data, I was hoping to see some documentation about it either under pd.Series.eq or pd.DatetimeIndex.__eq__ (since pd.DatetimeIndex is what the right side gets converted to before the comparison is made). In fact, it inside the source code for pd.DatetimeIndex.__eq__ that I was able to find the line that carries out the conversion. And, it does look like the code is from the decorator of the comparison methods of pd.DatetimeIndex. The following is how I was able to get to it:

import inspect

import pandas as pd

print(inspect.getsource(pd.DatetimeIndex.__eq__))

Neither pd.Series.eq nor pd.Series.__eq__ nor pd.DatetimeIndex.__eq__ seems to mention anything about this implicit type coercion.

  1. The third issue is: if these coercions are being done between datetime and string objects, what's to stop someone from submitting a pull request that would make the following 2 lines pass:
df.B == '2'
df.C == '2.0'

?

Expected Output

TypeError: invalid type comparison

The same way df.B == '2' and df.C == '2.0', I was expecting df.A == '2013-11-23' to also raise a TypeError.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.23.3 pytest: 3.6.3 pip: 18.0 setuptools: 40.0.0 Cython: 0.28.4 numpy: 1.15.1 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: 1.7.6 patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: 2.5.4 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.0.5 lxml: 4.2.3 bs4: 4.6.0 html5lib: 0.9999999 sqlalchemy: 1.2.8 pymysql: 0.9.2 psycopg2: 2.7.5 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

I am not familiar with the structure of the repo, but I am happy to drop some lines in the docstring of pd.Series.eq to indicate that type coercions can happen.

Thank you

mroeschke commented 6 years ago

Your examples do not raise any errors on master; was this fixed recently @jbrockmendel?

In [8]: pd.__version__
Out[8]: '0.24.0.dev0+780.g145c2275e'

In [9]: df.A == '2013-13-23'
   ...:
Out[9]:
0    False
1    False
2    False
3    False
Name: A, dtype: bool

In [10]: df.A == ['2013-10-10', '2013-11-30', '2013-10-13', '2013-13-23']
    ...:
Out[10]:
0    False
1    False
2    False
3    False
Name: A, dtype: bool
jbrockmendel commented 6 years ago

I'm not aware of any recent changes that would be relevant.

AbdouSeck commented 6 years ago

@mroeschke is that the expected and intended behavior in the next production iteration?

mroeschke commented 6 years ago

I don't think so. I am not the biggest can of the implicit coercion of these strings to datetimes, but given that it works for matching date strings I think you're right and this should raise a ValueError since 2013-13-23 cannot be parsed into a datetime.

And if the docs don't mention it already, it would be great to mention equality when comparing datetimes to date-like strings.

jbrockmendel commented 1 year ago

I think I'd be OK with deprecating the implicit coercion for both dt64 and Timestamp comparisons.