pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.26k stars 17.79k forks source link

pd.read_sql timestamptz converted to object dtype #30207

Open ThibTrip opened 4 years ago

ThibTrip commented 4 years ago

Code Sample, a copy-pastable example if possible

from pandas import Timestamp, NaT
import pandas as pd
from sqlalchemy import create_engine

# create test data
data = {'create_date': [Timestamp('2019-11-22 10:59:44+0000', tz='UTC'),
                        Timestamp('2019-11-21 15:27:41+0000', tz='UTC'),
                        Timestamp('2019-11-21 15:25:42+0000', tz='UTC'),
                        Timestamp('2019-11-19 14:35:52+0000', tz='UTC'),
                        Timestamp('2019-11-19 13:54:44+0000', tz='UTC'),
                        Timestamp('2019-11-14 15:12:00+0000', tz='UTC'),
                        Timestamp('2019-08-07 13:37:04+0000', tz='UTC'),
                        Timestamp('2019-12-04 14:47:42+0000', tz='UTC')],
       'change_date': [Timestamp('2019-11-22 10:59:44+0000', tz='UTC'),
                       Timestamp('2019-11-21 15:27:42+0000', tz='UTC'),
                       Timestamp('2019-11-21 15:25:43+0000', tz='UTC'),
                       Timestamp('2019-11-19 14:35:53+0000', tz='UTC'),
                       Timestamp('2019-11-19 13:54:45+0000', tz='UTC'),
                       Timestamp('2019-11-14 15:13:33+0000', tz='UTC'),
                       Timestamp('2019-08-09 13:01:13+0000', tz='UTC'),
                       Timestamp('2019-12-04 14:54:57+0000', tz='UTC')],
        'unsubscribe_date': [NaT,
                             Timestamp('2019-12-09 12:58:01+0000', tz='UTC'),
                             Timestamp('2019-12-09 12:58:28+0000', tz='UTC'),
                             NaT,
                             Timestamp('2019-12-09 12:58:24+0000', tz='UTC'),
                             Timestamp('2019-12-09 12:58:19+0000', tz='UTC'),
                             NaT,
                             NaT]}

# create DataFrame from test data 
df_test = pd.DataFrame(data)
print('test DataFrame\n', df_test, '\n')
print(df_test.dtypes, '\n')

# connect to a postgres database (set CONNECTION_STRING)
engine = create_engine('CONNECTION_STRING')

# save DataFrame to postgres
df_test.to_sql(name = 'timezone_test',
               con = engine, 
               index = False,
               if_exists = 'fail')

# make sure the data type in postgres is "timestamp with time zone"
df_db_dtypes = pd.read_sql("""SELECT column_name, data_type FROM information_schema.columns 
                              WHERE table_name = 'timezone_test'
                              AND table_schema = 'public';""", 
                           con = engine, 
                           index_col = 'column_name')
print('datatypes of test DataFrame in postgres\n', df_db_dtypes, '\n')

# read DataFrame from postgres
df_db = pd.read_sql('SELECT * FROM timezone_test', con = engine)
print('test DataFrame as read from postgres\n', df_db, '\n')
# data types should all be datetime64[ns, UTC]
print(df_db.dtypes, '\n')

# attempt to append the same data in the postgres table
try:
    df_db.to_sql(name = 'timezone_test', 
                 con = engine,
                 index = False,
                 if_exists = 'append')
except Exception as e:
    # we should get the error "Tz-aware datetime.datetime cannot be converted to datetime64\
    # unless utc=True"
    print('could not append the same data:',e,'\n')

# I think the problem is that pandas recognizes the data types in postgres are "timestamptz"\
# and tries to convert from "object" to datetime64 UTC dtype before saving but fails
try:
    pd.to_datetime(df_db['create_date']) # it would work with argument utc = True
except Exception as e:
    print('could not convert column "create_date" to datetime64[ns, UTC]:', e)

Output

test DataFrame
                 create_date               change_date          unsubscribe_date
0 2019-11-22 10:59:44+00:00 2019-11-22 10:59:44+00:00                       NaT
1 2019-11-21 15:27:41+00:00 2019-11-21 15:27:42+00:00 2019-12-09 12:58:01+00:00
2 2019-11-21 15:25:42+00:00 2019-11-21 15:25:43+00:00 2019-12-09 12:58:28+00:00
3 2019-11-19 14:35:52+00:00 2019-11-19 14:35:53+00:00                       NaT
4 2019-11-19 13:54:44+00:00 2019-11-19 13:54:45+00:00 2019-12-09 12:58:24+00:00
5 2019-11-14 15:12:00+00:00 2019-11-14 15:13:33+00:00 2019-12-09 12:58:19+00:00
6 2019-08-07 13:37:04+00:00 2019-08-09 13:01:13+00:00                       NaT
7 2019-12-04 14:47:42+00:00 2019-12-04 14:54:57+00:00                       NaT 

create_date         datetime64[ns, UTC]
change_date         datetime64[ns, UTC]
unsubscribe_date    datetime64[ns, UTC]
dtype: object 

datatypes of test DataFrame in postgres
                                  data_type
column_name                               
create_date       timestamp with time zone
change_date       timestamp with time zone
unsubscribe_date  timestamp with time zone 

test DataFrame as read from postgres
                  create_date                change_date          unsubscribe_date
0  2019-11-22 11:59:44+01:00  2019-11-22 11:59:44+01:00                       NaT
1  2019-11-21 16:27:41+01:00  2019-11-21 16:27:42+01:00 2019-12-09 12:58:01+00:00
2  2019-11-21 16:25:42+01:00  2019-11-21 16:25:43+01:00 2019-12-09 12:58:28+00:00
3  2019-11-19 15:35:52+01:00  2019-11-19 15:35:53+01:00                       NaT
4  2019-11-19 14:54:44+01:00  2019-11-19 14:54:45+01:00 2019-12-09 12:58:24+00:00
5  2019-11-14 16:12:00+01:00  2019-11-14 16:13:33+01:00 2019-12-09 12:58:19+00:00
6  2019-08-07 15:37:04+02:00  2019-08-09 15:01:13+02:00                       NaT
7  2019-12-04 15:47:42+01:00  2019-12-04 15:54:57+01:00                       NaT 

create_date                      object
change_date                      object
unsubscribe_date    datetime64[ns, UTC]
dtype: object 

could not append the same data: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True 

could not convert column "create_date" to datetime64[ns, UTC]: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

Problem description

There are 2 problems here:

  1. pandas reads 2 columns of the test DataFrame I save in posgres as "object" and not "datetime64[ns, UTC]" although in postgres the data types are all "timestamp with time zone"

  2. when I attempt to append the postgres table to itself via pandas (pd.DataFrame.read_sql then pd.DataFrame.to_sql) it fails when trying to convert the 2 columns "object" to datetime. So perhaps there is an issue with pd.to_datetime or I am missing something here.

I would make another issue for the second problem but I cannot reproduce it other than with this workflow.

Expected Output

df_db.to_sql(name = 'timezone_test', 
             con = engine,
             index = False,
             if_exists = 'append')

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 0.25.3 numpy : 1.16.4 pytz : 2019.1 dateutil : 2.8.0 pip : 19.1.1 setuptools : 41.0.1 Cython : 0.29.12 pytest : 5.0.1 hypothesis : None sphinx : 2.1.2 blosc : None feather : None xlsxwriter : 1.1.8 lxml.etree : 4.3.4 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.3 (dt dec pq3 ext lo64) jinja2 : 2.10.1 IPython : 7.6.1 pandas_datareader: None bs4 : 4.7.1 bottleneck : 1.2.1 fastparquet : None gcsfs : None lxml.etree : 4.3.4 matplotlib : 3.1.0 numexpr : 2.6.9 odfpy : None openpyxl : 2.6.2 pandas_gbq : None pyarrow : 0.14.0 pytables : None s3fs : None scipy : 1.2.1 sqlalchemy : 1.3.5 tables : 3.5.2 xarray : None xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.1.8
ThibTrip commented 4 years ago

I have found the culprit and there was a similar problem in the PR #11216.

So in the end the issue is not directly related to SQL and I think the example below pinpoints the problem. I don't see a solution other than discarding offsets.

import pandas as pd
import datetime
import psycopg2
from pandas.api.types import is_datetime64_any_dtype

# create datetimes with different offsets (60 and 120 minutes respectively)
data = [[datetime.datetime(2019, 11, 14, 16, 12, 
         tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60))],
        [datetime.datetime(2019, 8, 7, 15, 37, 4, 
         tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120))]]

# different offsets causes the data to be read as object dtype\
# instead of any datetime dtype
# pd.DataFrame.from_records is what is used in functions that read_sql uses
df = pd.DataFrame.from_records(data, columns = ['ts'])
df.dtypes
ts    object
dtype: object

# also this outputs False instead of True
is_datetime64_any_dtype(df['ts'])
False

# upon using pd.to_sql pd.to_datetime will be executed but
# it won't work since there are different offsets
pd.to_datetime(df['ts'])
Traceback ```python-traceback --------------------------------------------------------------------------- ValueError Traceback (most recent call last) ~/GitHub/pandas_master/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object) 1968 try: -> 1969 values, tz_parsed = conversion.datetime_to_datetime64(data) 1970 # If tzaware, these values represent unix timestamps, so we ~/GitHub/pandas_master/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64() ValueError: Array must be all same time zone During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) in ----> 1 pd.to_datetime(df['ts']) ~/GitHub/pandas_master/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache) 717 result = arg.map(cache_array) 718 else: --> 719 values = convert_listlike(arg._values, format) 720 result = arg._constructor(values, index=arg.index, name=arg.name) 721 elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)): ~/GitHub/pandas_master/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact) 431 errors=errors, 432 require_iso8601=require_iso8601, --> 433 allow_object=True, 434 ) 435 ~/GitHub/pandas_master/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object) 1972 return values.view("i8"), tz_parsed 1973 except (ValueError, TypeError): -> 1974 raise e 1975 1976 if tz_parsed is not None: ~/GitHub/pandas_master/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object) 1963 dayfirst=dayfirst, 1964 yearfirst=yearfirst, -> 1965 require_iso8601=require_iso8601, 1966 ) 1967 except ValueError as e: ~/GitHub/pandas_master/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime() ~/GitHub/pandas_master/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime() ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True ```

Solution with caveat (offsets are dropped)

pd.to_datetime(df['ts'], utc = True)
0   2019-11-14 15:12:00+00:00
1   2019-08-07 13:37:04+00:00
Name: ts, dtype: datetime64[ns, UTC]

I tested this directly in the master so you'll find the output of pd.show_versions() here again.

Output of pd.show_versions() INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-20-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 0.26.0.dev0+1382.g3577b5a34.dirty numpy : 1.17.3 pytz : 2019.3 dateutil : 2.8.1 pip : 19.3.1 setuptools : 42.0.2.post20191201 Cython : 0.29.14 pytest : 5.3.2 hypothesis : 4.55.4 sphinx : 2.3.0 blosc : None feather : None xlsxwriter : 1.2.6 lxml.etree : 4.4.2 html5lib : 1.0.1 pymysql : None psycopg2 : 2.8.4 (dt dec pq3 ext lo64) jinja2 : 2.10.3 IPython : 7.10.1 pandas_datareader: None bs4 : 4.8.1 bottleneck : 1.3.1 fastparquet : 0.3.2 gcsfs : None lxml.etree : 4.4.2 matplotlib : 3.1.2 numexpr : 2.7.0 odfpy : None openpyxl : 3.0.1 pandas_gbq : None pyarrow : 0.15.1 pytables : None pytest : 5.3.2 s3fs : 0.4.0 scipy : 1.4.0 sqlalchemy : 1.3.11 tables : 3.6.1 xarray : 0.14.1 xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.6
mroeschke commented 4 years ago

It's pandas' policy to convert TIMESTAMP WITH TIME ZONE database types to UTC in pandas, so converting the incoming data to UTC is an acceptable solution. Happy to have a PR with the change!

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#datetime-data-types

ThibTrip commented 4 years ago

Hi @mroeschke, thanks for your answer :). I was able to patch to_sql function. I was also able to patch the read_sql function but perhaps this is not a desired behavior. I will explain everything in a PR soon (after Christmas time).

ThibTrip commented 4 years ago

take