pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.91k stars 18.03k forks source link

BUG: read_csv returns object dtype for dates in empty frame #15524

Open adbull opened 7 years ago

adbull commented 7 years ago

Code Sample, a copy-pastable example if possible

>>> import io
>>> import pandas as pd

>>> csv = lambda: io.StringIO('date')
>>> df = pd.read_csv(csv(), dtype=int)
>>> df.dtypes

date    int64
dtype: object

>>> df = pd.read_csv(csv(), parse_dates=['date'])
>>> df.dtypes

date      object
dtype: object

>>> df.date.dt.tz_localize('UTC')

Traceback (most recent call last):
  File "<ipython-input-49-067c7ce27232>", line 1, in <module>
    df.date = df.date.dt.tz_localize('UTC')
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2740, in __getattr__
    return object.__getattribute__(self, name)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/base.py", line 241, in __get__
    return self.construct_accessor(instance)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/series.py", line 2743, in _make_dt_accessor
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

>>> parser = lambda x: pd.to_datetime(x, utc=True)
>>> df = pd.read_csv(csv(), parse_dates=['date'], date_parser=parser)
>>> df.dtypes

date      object
dtype: object

>>> df.date.dt.tz_convert('EST')

Traceback (most recent call last):
  File "<ipython-input-129-3d458af8513c>", line 1, in <module>
    df.date.dt.tz_convert('EST')
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/generic.py", line 2740, in __getattr__
    return object.__getattribute__(self, name)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/base.py", line 241, in __get__
    return self.construct_accessor(instance)
  File "~/anaconda/lib/python3.5/site-packages/pandas/core/series.py", line 2743, in _make_dt_accessor
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

Problem description

When reading CSVs with no data rows, read_csv() returns the dtype object for dates, which can raise errors on later manipulation. This is contrary to the general behaviour of read_csv(), which otherwise correctly sets dtypes for empty frames when those dtypes are explicitly passed.

I don't think it would be hard to return the correct dtype here? If date_parser is not set, we know the dtype is datetime64[ns]; otherwise, we can call the parser with empty data, and use the returned dtype.

Note that e.g. read_csv(..., dtype='datetime64[ns]') is not a solution, as this throws an error when the csv is non-empty.

Expected Output

date    int64
dtype: object

date      datetime64[ns]
dtype: object

date      datetime64[ns, UTC]
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Darwin OS-release: 16.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: 0.9.1 IPython: 4.2.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.3.0 numexpr: 2.6.1 matplotlib: 1.5.3 openpyxl: 2.4.0 xlrd: 1.0.0 xlwt: 1.1.2 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.4 pymysql: None psycopg2: None jinja2: 2.8.1 boto: 2.45.0 pandas_datareader: None
jreback commented 7 years ago

datetime parsing is pretty soft, meaning unless it can unequivocably convert something it won't do it. You can certainly force this after the fact via .to_datetime(...). The default for non-converting something (to any) dtype is simply object. This is the safest way to do things.

I am not sure that I would coerce these empty columns like this (even though we certainly can, at least for non-tz aware, which won't work in your example at all. The dtype is not-defined). This can easily lead to mistakes, not to mention that parse_dates=True will try to soft coerce lots of things (but I suppose that could ignore a forced conversion).

thoughts

@jorisvandenbossche cc @gfyoung

jreback commented 7 years ago

@adbull can you show a situation when this actually matters / makes a difference? why are you trying to do this? For example concatting an empty frame with one that is correctly dtypes will work, so this doesn't practically make a difference.

adbull commented 7 years ago

Well, if I call read_csv with a specific column in parse_dates, I definitely want the column to be a datetime, and since the column is empty, it will unequivocally succeed in that conversion -- it's not like it can contain a parse error.

I'm not sure I follow you on the second paragraph? What won't work in my example?

Concatting an empty frame of dtype object to a non-empty one of dtype datetime64[ns, UTC] throws an error, is that not intended?

Anyway, potential use-cases are (a) concatting together multiple csvs, some of which may be empty; (b) regularly loading a csv which is frequently updated, and may or may not be empty. In (a), I would have to assign timezones after concat to stop it throwing an error; in (b), I would have to specifically check whether the frame is empty before attempting any timezone manipulation.

The workarounds aren't too hard, it just seems a bit awkward when the simple version would work fine for any other dtype.

gfyoung commented 7 years ago

@jreback : I'm torn on this. On the one hand, I agree with you that there is not a lot of value (from a functionality perspective) in supporting this behavior BUT that being said, @adbull does have a point about consistency. I turn your attention to some other inconsistencies related to this:

>>> Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])
>>>
>>> Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]
gfyoung commented 7 years ago

@jreback : Thoughts on what I said above?

jreback commented 7 years ago

Series([]).astype(np.datetime64) Series([], dtype: datetime64[ns])

Series([], dtype=np.datetime64) ... TypeError: cannot convert datetimelike to dtype [datetime64]

is like what we are discussing in #15859

I agree that should work. As an aside, I think we need to systematically test all .astype vs direct construction on empties. (can create another issue if you want). mainly want the test (so if you can add that would be great, just xfailing those things that don't work).

gfyoung commented 7 years ago

Okay, sounds good. I suspect that what I presented here is probably related to #15859 initially and will trickle down to patching this issue here.

gfyoung commented 7 years ago

@jreback : So the Exception raised traces to maybe_cast_to_datetime, which specifically checks if the dtype specified exactly matches datetime64[ns], which is NOT the same as np.datetime64 (no frequency in the latter dtype).

Thus, we would need to soften the restrictions on casting to datetime64 (similarly for timedelta64). This is an API change. Thoughts?

gfyoung commented 7 years ago

Also tracked down the cause of the original bug. _get_empty_meta, which is what is called when there is no data to parse has no handling for parse_dates, but it does for dtype.

Unfortunately, patching this is not so straightforward because there is a lot of logical surrounding what columns get parsed depending on how you specify parse_dates, which feels very duplicative since we have it already for the C and Python engine but in two different implementations (one in Cython and the other in pure Python).

So yes, we can easily patch by replicating the parse_dates logic, but that does not seem optimal...

@jreback , @jorisvandenbossche : Thoughts?

jreback commented 7 years ago

@gfyoung I think the date parsing needs to be factored out generically applicable to both engines :> its a project but will make this much more generic (and not sacrifice any perf). Then this is easy.

gfyoung commented 7 years ago

@jreback : Any thoughts about my questions above regarding Series ?

jorisvandenbossche commented 7 years ago

@gfyoung so the question is if we generally interpret np.datetime64 (if passed by the user) as 'datetime64[ns]', since 'ns' is the default resolution? And now how this is handled differs between different functions?

gfyoung commented 7 years ago

@jorisvandenbossche : I think the inconsistency is pretty clearly illustrated in my above comment here. It should be one way or the other (raise or interpret as datetime64[ns]).

jreback commented 7 years ago

so we transform np.datetime64 -> np.datetime64[ns] (well we actually interpret it according to whatever freq it actually is). In this case, this just says hey make it the default datetype, so this would be totally fine to do.

Series([], dtype=np.datetime64), IOW I would be fine accepting this. Note that the logic is in pandas.types.cast.maybe_cast_to_datetime.

though you have to see what this does break. If its just specific validation tests, then can change those.

jorisvandenbossche commented 7 years ago

I would actually also be fine with following numpy here, which means raising on dtype=np.datetime64. I don't see a strong reason to allow this.

gfyoung commented 7 years ago

@jorisvandenbossche : What do you mean?

>>> np.array([], dtype=np.datetime64)
array([], dtype=datetime64)
jreback commented 7 years ago

@gfyoung the bug here is this, right:

In [2]: pd.read_csv(StringIO('date'), parse_dates=['date']).dtypes
Out[2]: 
date    object
dtype: object

IOW we have an empty date column, which should be coerced to datetime64[ns]

This is similar to this behavior. where does np.datetime64 enter this?

In [4]: pd.to_datetime([])
Out[4]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)
jorisvandenbossche commented 7 years ago

Sorry, I assumed it was the same as on non-empty array. And in pandas non-empty both already raise:

In [11]: pd.Series([1], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]

In [12]: pd.Series([1]).astype(np.datetime64)
...
SystemError: <built-in function cast_to_nanoseconds> returned a result with an error set

(although the second error message is not a correct one)

So I was mixing the empty and non-empty cases.

gfyoung commented 7 years ago

@jreback : The np.datetime64 issue came about because it seemed pertinent to the original issue.

jreback commented 7 years ago

@gfyoung yeah not sure where that exactly came up....

gfyoung commented 7 years ago

@jorisvandenbossche : Those examples you bring up above, should we follow in numpy 's footsteps? We only use one frequency, so I don't see why those two examples wouldn't work.

jreback commented 7 years ago

@gfyoung so I don't see where this is passing an explicit np.datetime64 for constrution (could be buried somewhere of course), if that IS the case then I think this is wrong. Can you point to this?

gfyoung commented 7 years ago

@jreback : So the original issue does not involve passing an explicit np.datetime64 (the reason for it occurring is something completely separate from that as later analysis shows).

RileyLazarou commented 3 years ago

Not sure if this is the right place to put this, but I'm having the same issue with read_sql. On occasion, my query will return no results and a dataframe with no records. In processing that data, I'll do something like

df = pd.read_sql(query, con=con, parse_dates=["datetime_column"])  # returns empty dataframe
df["seconds_passed"] = (now - df["datetime_column"]).dt.total_seconds()

which results in

AttributeError: Can only use .dt accessor with datetimelike values