Open adbull opened 7 years ago
datetime parsing is pretty soft, meaning unless it can unequivocably convert something it won't do it. You can certainly force this after the fact via .to_datetime(...)
. The default for non-converting something (to any) dtype is simply object
. This is the safest way to do things.
I am not sure that I would coerce these empty columns like this (even though we certainly can, at least for non-tz aware, which won't work in your example at all. The dtype is not-defined). This can easily lead to mistakes, not to mention that parse_dates=True
will try to soft coerce lots of things (but I suppose that could ignore a forced conversion).
thoughts
@jorisvandenbossche cc @gfyoung
@adbull can you show a situation when this actually matters / makes a difference? why are you trying to do this? For example concatting an empty frame with one that is correctly dtypes will work, so this doesn't practically make a difference.
Well, if I call read_csv
with a specific column in parse_dates
, I definitely want the column to be a datetime, and since the column is empty, it will unequivocally succeed in that conversion -- it's not like it can contain a parse error.
I'm not sure I follow you on the second paragraph? What won't work in my example?
Concatting an empty frame of dtype object
to a non-empty one of dtype datetime64[ns, UTC]
throws an error, is that not intended?
Anyway, potential use-cases are (a) concatting together multiple csvs, some of which may be empty; (b) regularly loading a csv which is frequently updated, and may or may not be empty. In (a), I would have to assign timezones after concat to stop it throwing an error; in (b), I would have to specifically check whether the frame is empty before attempting any timezone manipulation.
The workarounds aren't too hard, it just seems a bit awkward when the simple version would work fine for any other dtype.
@jreback : I'm torn on this. On the one hand, I agree with you that there is not a lot of value (from a functionality perspective) in supporting this behavior BUT that being said, @adbull does have a point about consistency. I turn your attention to some other inconsistencies related to this:
>>> Series([]).astype(np.datetime64)
Series([], dtype: datetime64[ns])
>>>
>>> Series([], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]
Series([]).astype(np.datetime64) Series([], dtype: datetime64[ns])
Series([], dtype=np.datetime64) ... TypeError: cannot convert datetimelike to dtype [datetime64]
is like what we are discussing in #15859
I agree that should work. As an aside, I think we need to systematically test all .astype
vs direct construction on empties. (can create another issue if you want). mainly want the test (so if you can add that would be great, just xfailing those things that don't work).
Okay, sounds good. I suspect that what I presented here is probably related to #15859 initially and will trickle down to patching this issue here.
@jreback : So the Exception raised traces to maybe_cast_to_datetime
, which specifically checks if the dtype
specified exactly matches datetime64[ns]
, which is NOT the same as np.datetime64
(no frequency in the latter dtype
).
Thus, we would need to soften the restrictions on casting to datetime64
(similarly for timedelta64
). This is an API change. Thoughts?
Also tracked down the cause of the original bug. _get_empty_meta
, which is what is called when there is no data to parse has no handling for parse_dates
, but it does for dtype
.
Unfortunately, patching this is not so straightforward because there is a lot of logical surrounding what columns get parsed depending on how you specify parse_dates
, which feels very duplicative since we have it already for the C and Python engine but in two different implementations (one in Cython and the other in pure Python).
So yes, we can easily patch by replicating the parse_dates
logic, but that does not seem optimal...
@jreback , @jorisvandenbossche : Thoughts?
@gfyoung I think the date parsing needs to be factored out generically applicable to both engines :> its a project but will make this much more generic (and not sacrifice any perf). Then this is easy.
@gfyoung so the question is if we generally interpret np.datetime64
(if passed by the user) as 'datetime64[ns]'
, since 'ns' is the default resolution?
And now how this is handled differs between different functions?
@jorisvandenbossche : I think the inconsistency is pretty clearly illustrated in my above comment here. It should be one way or the other (raise or interpret as datetime64[ns]
).
so we transform np.datetime64
-> np.datetime64[ns]
(well we actually interpret it according to whatever freq it actually is). In this case, this just says hey make it the default datetype, so this would be totally fine to do.
Series([], dtype=np.datetime64)
, IOW I would be fine accepting this. Note that the logic is in pandas.types.cast.maybe_cast_to_datetime
.
though you have to see what this does break. If its just specific validation tests, then can change those.
I would actually also be fine with following numpy here, which means raising on dtype=np.datetime64
. I don't see a strong reason to allow this.
@jorisvandenbossche : What do you mean?
>>> np.array([], dtype=np.datetime64)
array([], dtype=datetime64)
@gfyoung the bug here is this, right:
In [2]: pd.read_csv(StringIO('date'), parse_dates=['date']).dtypes
Out[2]:
date object
dtype: object
IOW we have an empty date column, which should be coerced to datetime64[ns]
This is similar to this behavior. where does np.datetime64
enter this?
In [4]: pd.to_datetime([])
Out[4]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)
Sorry, I assumed it was the same as on non-empty array. And in pandas non-empty both already raise:
In [11]: pd.Series([1], dtype=np.datetime64)
...
TypeError: cannot convert datetimelike to dtype [datetime64]
In [12]: pd.Series([1]).astype(np.datetime64)
...
SystemError: <built-in function cast_to_nanoseconds> returned a result with an error set
(although the second error message is not a correct one)
So I was mixing the empty and non-empty cases.
@jreback : The np.datetime64
issue came about because it seemed pertinent to the original issue.
@gfyoung yeah not sure where that exactly came up....
@jorisvandenbossche : Those examples you bring up above, should we follow in numpy
's footsteps? We only use one frequency, so I don't see why those two examples wouldn't work.
@gfyoung so I don't see where this is passing an explicit np.datetime64
for constrution (could be buried somewhere of course), if that IS the case then I think this is wrong. Can you point to this?
@jreback : So the original issue does not involve passing an explicit np.datetime64
(the reason for it occurring is something completely separate from that as later analysis shows).
Not sure if this is the right place to put this, but I'm having the same issue with read_sql
. On occasion, my query will return no results and a dataframe with no records. In processing that data, I'll do something like
df = pd.read_sql(query, con=con, parse_dates=["datetime_column"]) # returns empty dataframe
df["seconds_passed"] = (now - df["datetime_column"]).dt.total_seconds()
which results in
AttributeError: Can only use .dt accessor with datetimelike values
Code Sample, a copy-pastable example if possible
Problem description
When reading CSVs with no data rows,
read_csv()
returns the dtypeobject
for dates, which can raise errors on later manipulation. This is contrary to the general behaviour ofread_csv()
, which otherwise correctly sets dtypes for empty frames when those dtypes are explicitly passed.I don't think it would be hard to return the correct dtype here? If
date_parser
is not set, we know the dtype isdatetime64[ns]
; otherwise, we can call the parser with empty data, and use the returned dtype.Note that e.g.
read_csv(..., dtype='datetime64[ns]')
is not a solution, as this throws an error when the csv is non-empty.Expected Output
Output of
pd.show_versions()