pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.41k stars 17.83k forks source link

Assign back converted multiple columns to datetime failed #20511

Closed jesrael closed 2 years ago

jesrael commented 6 years ago

I want convert multiple columns to datetimes:

df = pd.DataFrame(np.random.choice(['2015-01-01','2016-01-01'], size=(5,6))).add_prefix('date')
print (df)
        date0       date1       date2       date3       date4       date5
0  2015-01-01  2016-01-01  2015-01-01  2015-01-01  2016-01-01  2016-01-01
1  2016-01-01  2016-01-01  2016-01-01  2015-01-01  2016-01-01  2015-01-01
2  2015-01-01  2015-01-01  2016-01-01  2016-01-01  2016-01-01  2015-01-01
3  2016-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01
4  2015-01-01  2016-01-01  2015-01-01  2016-01-01  2016-01-01  2015-01-01

i= range(0,6,2)
df.iloc[:,i] = df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce'))
print (df.dtypes)
date0    object
date1    object
date2    object
date3    object
date4    object
date5    object
dtype: object

print (df)
                 date0       date1                date2       date3  \
0  1420070400000000000  2016-01-01  1420070400000000000  2015-01-01   
1  1451606400000000000  2016-01-01  1451606400000000000  2015-01-01   
2  1420070400000000000  2015-01-01  1451606400000000000  2016-01-01   
3  1451606400000000000  2015-01-01  1420070400000000000  2015-01-01   
4  1420070400000000000  2016-01-01  1420070400000000000  2016-01-01   

                 date4       date5  
0  1451606400000000000  2016-01-01  
1  1451606400000000000  2015-01-01  
2  1451606400000000000  2015-01-01  
3  1420070400000000000  2015-01-01  
4  1451606400000000000  2015-01-01  

I test converting:

print (df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce')).dtypes)
date0    datetime64[ns]
date2    datetime64[ns]
date4    datetime64[ns]
dtype: object

But assign back convert datetimes to unix dates. Also I test loc and same problem.

print (pd.show_versions())

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: sk_SK
LOCALE: None.None

pandas: 0.22.0
pytest: 2.8.5
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
pyarrow: 0.7.0
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.0
openpyxl: 2.4.8
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.4.0
None
TomAugspurger commented 6 years ago

It's not entirely clear what you're trying to do. For

I want convert multiple columns to datetimes:

you can use apply and to_datetime

In [15]: df2 = df.apply(pd.to_datetime, errors='coerce')

In [16]: df2.dtypes
Out[16]:
date0    datetime64[ns]
date1    datetime64[ns]
date2    datetime64[ns]
date3    datetime64[ns]
date4    datetime64[ns]
date5    datetime64[ns]
dtype: object
jesrael commented 6 years ago

@TomAugspurger - soory, maybe miss:

But assign back convert datetimes to unix dates. Also I test loc and same problem.

It converting nice if assign to new DataFrame, but failed if assign to subset, then datetimes columns are converting to unix datetime? Or something else?

TomAugspurger commented 6 years ago

I'm not sure I understand. You might want to try searching stackoverflow under the pandas tag: https://stackoverflow.com/questions/tagged/pandas

On Thu, Mar 29, 2018 at 12:44 AM, jesrael notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger - soory, maybe miss:

But assign back convert datetimes to unix dates. Also I test loc and same problem.

It converting nice if assign to new DataFrame, but failed if assign to subset, then datetimes columns are converting to unix datetime? Or something else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/20511#issuecomment-377128228, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIrIAYurh0yc7sJ_ykjEtqZzAP1q5ks5tjHShgaJpZM4S-GLA .

jesrael commented 6 years ago

@TomAugspurger - Hmmm, I am user https://stackoverflow.com/users/2901002/jezrael

I try explain more:

df = pd.DataFrame(np.random.choice(['2015-01-01','2016-01-01'], size=(5,6))).add_prefix('date')
print (df)
        date0       date1       date2       date3       date4       date5
0  2015-01-01  2016-01-01  2015-01-01  2015-01-01  2016-01-01  2016-01-01
1  2016-01-01  2016-01-01  2016-01-01  2015-01-01  2016-01-01  2015-01-01
2  2015-01-01  2015-01-01  2016-01-01  2016-01-01  2016-01-01  2015-01-01
3  2016-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01  2015-01-01
4  2015-01-01  2016-01-01  2015-01-01  2016-01-01  2016-01-01  2015-01-01

i= range(0,6,2)
df.iloc[:,i] = df.iloc[:,i].apply(lambda x: pd.to_datetime(x, errors='coerce'))

I get:

print (df.dtypes)
date0    object
date1    object
date2    object
date3    object
date4    object
date5    object
dtype: object

and I expected:

print (df.dtypes)
date0    datetime64[ns]
date1    object
date2    datetime64[ns]
date3    object
date4    datetime64[ns]
date5    object
dtype: object
shippy commented 5 years ago

@TomAugspurger This really looks like a bug. Let's see if I can explain it more clearly.

The bug presents in two ways:

  1. .apply(pd.to_datetime) called on a multi-column slice converts the columns to datetime64 after the call, but not during the assignment to the same multi-column slice.
  2. Further, assignment of the result of multi-column .apply(pd.to_datetime) transforms the datetime string to a nanosecond timestamp.

This violates expectations in two ways:

  1. The column doesn't look as expected, and there's no hint as to how to get the expected format back.
  2. The column doesn't behave as expected, e.g. cannot be subtracted from other datetime columns,

To demonstrate, let's set up a six-column DataFrame. The leftmost column will stay as-is, for side-by-side comparison; the rest will have pd.to_datetime called upon them in various forms. All five rightmost columns should all be datetime64 and equal to one another by the end of the exercise. Only the three rightmost will.

We'll recreate the bug in apply_multislice1 and apply_multislice2.

test_df = (pd.DataFrame(np.random.choice(['2016-01-01'], size=(2, 6)))
        .add_prefix('date')
        .rename(columns={
            'date0': 'original',
            'date1': 'apply_multislice1',
            'date2': 'apply_multislice2',
            'date3': 'apply_unislice',
            'date4': 'apply_noslice',
            'date5': 'assigned'}))
     original apply_multislice1 apply_multislice2 apply_unislice apply_noslice    assigned
0  2016-01-01        2016-01-01        2016-01-01     2016-01-01    2016-01-01  2016-01-01
1  2016-01-01        2016-01-01        2016-01-01     2016-01-01    2016-01-01  2016-01-01

Bug demonstration

I'll quickly demonstrate the case that succeeds without assignment:

assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime) == pd.to_datetime('2016-01-01')).all(axis=None)

If it's assigned back to itself, however, it changes value to a nanosecond timestamp that still has dtype object:

# Assignment
test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] = test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime)

# Succeeds, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == 1451606400000000000).all(axis=None)
assert test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].dtypes.isin([object]).all()

# Fails, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == pd.to_datetime('2016-01-01')).all(axis=None)
assert test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].dtypes.isin(['datetime64']).all()

Variations without bug

I could think of three other ways to do the datetime conversion - directly, and through .apply with different slicing:

test_df.loc[:, 'apply_unislice'] = test_df.loc[:, ['apply_unislice']].apply(pd.to_datetime)
test_df.loc[:, 'apply_noslice'] = test_df.loc[:, 'apply_noslice'].apply(pd.to_datetime)
test_df.loc[:, 'assigned'] = pd.to_datetime(test_df.loc[:, 'assigned'])

# Succeeds, as expected:
assert (test_df.loc[:, ['apply_unislice', 'apply_noslice', 'assigned']] == pd.to_datetime('2016-01-01')).all(axis=None)
assert not test_df.loc[:, ['apply_unislice', 'apply_noslice', 'assigned']].dtypes.isin([object]).any(axis=None)

Final outcome

All five rightmost columns should have the same content and dtype, but by now we know they don't.

     original    apply_multislice1    apply_multislice2 apply_unislice apply_noslice   assigned
0  2016-01-01  1451606400000000000  1451606400000000000     2016-01-01    2016-01-01 2016-01-01
1  2016-01-01  1451606400000000000  1451606400000000000     2016-01-01    2016-01-01 2016-01-01
original                     object
apply_multislice1            object
apply_multislice2            object
apply_unislice       datetime64[ns]
apply_noslice        datetime64[ns]
assigned             datetime64[ns]

Recovery from the bug

If the bugged outcome in apply_multislice* is converted as nanoseconds and assigned, pd.to_datetime will change dtype correctly, so at least the value wasn't lost.

# Continuing where we left off: Succeeds, but shouldn't:
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == 1451606400000000000).all(axis=None)
# Call pd.to_datetime(unit='ns')
test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] = test_df.loc[:, ['apply_multislice1', 'apply_multislice2']].apply(pd.to_datetime, unit='ns')
# Operates as expected
assert (test_df.loc[:, ['apply_multislice1', 'apply_multislice2']] == pd.to_datetime('2016-01-01')).all(axis=None)

Environment

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-137-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
TomAugspurger commented 5 years ago

@jesrael sorry I missed the fact that assignment was the buggy part.

In [48]: df = pd.DataFrame({"A": ['2015-01-01', '2015-01-02'], 'B': ['2015', '2016']})

In [49]: df2 = df.copy()

In [50]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})

In [51]: df2
Out[51]:
                     A     B
0  1420070400000000000  2015
1  1451606400000000000  2016

A few observations:

This doesn't occur when all of the columns are being updated:

In [64]: df2.iloc[:, [0, 1]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016']), 'B': pd.to_datetime(['2015', '2016'])})

In [65]: df2
Out[65]:
           A          B
0 2015-01-01 2015-01-01
1 2016-01-01 2016-01-01

The internal blocks are unsurprisingly incorrect

In [69]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})

In [70]: df2._data
Out[70]:
BlockManager
Items: Index(['A', 'B'], dtype='object')
Axis 1: RangeIndex(start=0, stop=2, step=1)
ObjectBlock: slice(0, 2, 1), 2 x 2, dtype: object

we'd like to split that object block so that the newly assigned column becomes a DatetimeBlock.

mroeschke commented 3 years ago

This looks okay on master. Could use a test

In [13]: In [48]: df = pd.DataFrame({"A": ['2015-01-01', '2015-01-02'], 'B': ['2015', '2016']})
    ...:
    ...: In [49]: df2 = df.copy()
    ...:
    ...: In [50]: df2.iloc[:, [0]] = pd.DataFrame({"A": pd.to_datetime(['2015', '2016'])})

In [14]: df2
Out[14]:
                     A     B
0  2015-01-01 00:00:00  2015
1  2016-01-01 00:00:00  2016
parthi-siva commented 2 years ago

take