Closed jesrael closed 2 years ago
It's not entirely clear what you're trying to do. For
I want convert multiple columns to datetimes:
you can use apply
and to_datetime
In [15]: df2 = df.apply(pd.to_datetime, errors='coerce')
In [16]: df2.dtypes
Out[16]:
date0 datetime64[ns]
date1 datetime64[ns]
date2 datetime64[ns]
date3 datetime64[ns]
date4 datetime64[ns]
date5 datetime64[ns]
dtype: object
@TomAugspurger - soory, maybe miss:
But assign back convert datetimes to unix dates. Also I test loc and same problem.
It converting nice if assign to new DataFrame, but failed if assign to subset, then datetimes columns are converting to unix datetime? Or something else?
I'm not sure I understand. You might want to try searching stackoverflow under the pandas tag: https://stackoverflow.com/questions/tagged/pandas
On Thu, Mar 29, 2018 at 12:44 AM, jesrael notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger - soory, maybe miss:
But assign back convert datetimes to unix dates. Also I test loc and same problem.
It converting nice if assign to new DataFrame, but failed if assign to subset, then datetimes columns are converting to unix datetime? Or something else?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/20511#issuecomment-377128228, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIrIAYurh0yc7sJ_ykjEtqZzAP1q5ks5tjHShgaJpZM4S-GLA .
@TomAugspurger - Hmmm, I am user https://stackoverflow.com/users/2901002/jezrael
I try explain more:
df = pd.DataFrame(np.random.choice(['2015-01-01','2016-01-01'], size=(5,6))).add_prefix('date')
print (df)
date0 date1 date2 date3 date4 date5
0 2015-01-01 2016-01-01 2015-01-01 2015-01-01 2016-01-01 2016-01-01
1 2016-01-01 2016-01-01 2016-01-01 2015-01-01 2016-01-01 2015-01-01
2 2015-01-01 2015-01-01 2016-01-01 2016-01-01 2016-01-01 2015-01-01
3 2016-01-01 2015-01-01 2015-01-01 2015-01-01 2015-01-01 2015-01-01
4 2015-01-01 2016-01-01 2015-01-01 2016-01-01 2016-01-01 2015-01-01
i= range(0,6,2)
df.iloc[:,i] = df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce'))
I get:
print (df.dtypes)
date0 object
date1 object
date2 object
date3 object
date4 object
date5 object
dtype: object
and I expected:
print (df.dtypes)
date0 datetime64[ns]
date1 object
date2 datetime64[ns]
date3 object
date4 datetime64[ns]
date5 object
dtype: object
@TomAugspurger This really looks like a bug. Let's see if I can explain it more clearly.
The bug presents in two ways:
.apply(pd.to_datetime)
called on a multi-column slice converts the columns to datetime64
after the call, but not during the assignment to the same multi-column slice..apply(pd.to_datetime)
transforms the datetime string to a nanosecond timestamp.This violates expectations in two ways:
To demonstrate, let's set up a six-column DataFrame. The leftmost column will stay as-is, for side-by-side comparison; the rest will have pd.to_datetime
called upon them in various forms. All five rightmost columns should all be datetime64
and equal to one another by the end of the exercise. Only the three rightmost will.
We'll recreate the bug in apply_multislice1
and apply_multislice2
.
test_df = (pd.DataFrame(np.random.choice(['2016-01-01'], size=(2, 6)))
.add_prefix('date')
.rename(columns={
'date0': 'original',
'date1': 'apply_multislice1',
'date2': 'apply_multislice2',
'date3': 'apply_unislice',
'date4': 'apply_noslice',
'date5': 'assigned'}))
original apply_multislice1 apply_multislice2 apply_unislice apply_noslice assigned
0 2016-01-01 2016-01-01 2016-01-01 2016-01-01 2016-01-01 2016-01-01
1 2016-01-01 2016-01-01 2016-01-01 2016-01-01 2016-01-01 2016-01-01
I'll quickly demonstrate the case that succeeds without assignment:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime) == pd.to_datetime('2016-01-01')).all(axis=None)
If it's assigned back to itself, however, it changes value to a nanosecond timestamp that still has dtype object
:
# Assignment
test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] = test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime)
# Succeeds, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == 1451606400000000000).all(axis=None)
assert test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].dtypes.isin([object]).all()
# Fails, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == pd.to_datetime('2016-01-01')).all(axis=None)
assert test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].dtypes.isin(['datetime64']).all()
I could think of three other ways to do the datetime conversion - directly, and through .apply
with different slicing:
test_df.loc[:, 'apply_unislice'] = test_df.loc[:, ['apply_unislice']].apply(pd.to_datetime)
test_df.loc[:, 'apply_noslice'] = test_df.loc[:, 'apply_noslice'].apply(pd.to_datetime)
test_df.loc[:, 'assigned'] = pd.to_datetime(test_df.loc[:, 'assigned'])
# Succeeds, as expected:
assert (test_df.loc[:, ['apply_unislice', 'apply_noslice', 'assigned']] == pd.to_datetime('2016-01-01')).all(axis=None)
assert not test_df.loc[:, ['apply_unislice', 'apply_noslice', 'assigned']].dtypes.isin([object]).any(axis=None)
All five rightmost columns should have the same content and dtype, but by now we know they don't.
original apply_multislice1 apply_multislice2 apply_unislice apply_noslice assigned
0 2016-01-01 1451606400000000000 1451606400000000000 2016-01-01 2016-01-01 2016-01-01
1 2016-01-01 1451606400000000000 1451606400000000000 2016-01-01 2016-01-01 2016-01-01
original object
apply_multislice1 object
apply_multislice2 object
apply_unislice datetime64[ns]
apply_noslice datetime64[ns]
assigned datetime64[ns]
If the bugged outcome in apply_multislice*
is converted as nanoseconds and assigned, pd.to_datetime
will change dtype correctly, so at least the value wasn't lost.
# Continuing where we left off: Succeeds, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == 1451606400000000000).all(axis=None)
# Call pd.to_datetime(unit='ns')
test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] = test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime, unit='ns')
# Operates as expected
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == pd.to_datetime('2016-01-01')).all(axis=None)
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-137-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
@jesrael sorry I missed the fact that assignment was the buggy part.
In [48]: df = pd.DataFrame({"A": ['2015-01-01', '2015-01-02'], 'B': ['2015', '2016']})
In [49]: df2 = df.copy()
In [50]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})
In [51]: df2
Out[51]:
A B
0 1420070400000000000 2015
1 1451606400000000000 2016
A few observations:
This doesn't occur when all of the columns are being updated:
In [64]: df2.iloc[:, [0, 1]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016']), 'B': pd.to_datetime(['2015', '2016'])})
In [65]: df2
Out[65]:
A B
0 2015-01-01 2015-01-01
1 2016-01-01 2016-01-01
The internal blocks are unsurprisingly incorrect
In [69]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})
In [70]: df2._data
Out[70]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
ObjectBlock: slice(0, 2, 1), 2 x 2, dtype: object
we'd like to split that object block so that the newly assigned column becomes a DatetimeBlock.
This looks okay on master. Could use a test
In [13]: In [48]: df = pd.DataFrame({"A": ['2015-01-01', '2015-01-02'], 'B': ['2015', '2016']})
...:
...: In [49]: df2 = df.copy()
...:
...: In [50]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})
In [14]: df2
Out[14]:
A B
0 2015-01-01 00:00:00 2015
1 2016-01-01 00:00:00 2016
take
I want convert multiple columns to datetimes:
I test converting:
But assign back convert datetimes to unix dates. Also I test
loc
and same problem.