mymarilyn / clickhouse-driver

ClickHouse Python Driver with native interface support
https://clickhouse-driver.readthedocs.io
Other
1.21k stars 213 forks source link

Pandas DateTime column timezone incorrectly converted at insertion #380

Open 0liu opened 1 year ago

0liu commented 1 year ago

Describe the bug When inserting a pandas dataframe to a Clickhouse table, and a DateTime column in the table is defined with a timezone other than UTC, timezone-aware datatime columns in the pandas dataframe will be incorrectly localized and inserted as UTC time. The actual timestamps stored in the database are shifted from the original timestamps.

This could be due to the incorrect timezone localization and conversion in apply_timezones_before_write

            ts = pd.to_datetime(items).tz_localize(timezone)

        ts = ts.tz_convert('UTC')

The items returned from block.get_column_by_index should be UNIX timestamps and are always in UTC, but pd.to_datetime(items).tz_localize(timezone) localizes the timestamp to the timezone defined in the table column ('America/Chicago' in the example below) incorrectly, and then converts it back to UTC. If commenting out these two parts as below, the problem is gone:

            ts = pd.to_datetime(items)  #.tz_localize(timezone)

#        ts = ts.tz_convert('UTC')

The workaround works with datetime table column defined in UTC or other timezones.

To Reproduce

from clickhouse_driver import Client
client = Client('localhost', database='default')
client.execute(
"""
    CREATE TABLE IF NOT EXISTS debug_tbl 
    (ts DateTime64(9, 'America/Chicago')) 
    ENGINE=MergeTree() ORDER BY ts
""")

data = pd.DataFrame(
    [
        pd.Timestamp(t, tz='America/Chicago') 
        for t in ['2023-06-01T11:28:05.661537256', '2023-06-01T11:28:06.334573921', '2023-06-01T11:28:07.821988266']
    ], 
columns=['ts']
)

# Insert pandas dataframe with numpy support
client.insert_dataframe('INSERT INTO debug_tbl VALUES', data, settings={'use_numpy': True})

# query the inserted data
client.query_dataframe("select * from debug_tbl", settings={'use_numpy': False})

The returned query results are:

                                ts
0 2023-06-01 16:28:05.661537-05:00
1 2023-06-01 16:28:06.334574-05:00
2 2023-06-01 16:28:07.821988-05:00

But it should return

                                ts
0 2023-06-01 11:28:05.661537-05:00
1 2023-06-01 11:28:06.334574-05:00
2 2023-06-01 11:28:07.821988-05:00

In the command line, clickhouse-client shows the inserted data are not correct:

69aa60b1a9d2 :) select * from debug_tbl

SELECT *
FROM debug_tbl

Query id: 4229d695-9b96-40b3-b4c7-742dd6276cd4

┌────────────────────────────ts─┐
│ 2023-06-01 16:28:05.661537256 │
│ 2023-06-01 16:28:06.334573921 │
│ 2023-06-01 16:28:07.821988266 │
└───────────────────────────────┘

3 rows in set. Elapsed: 0.003 sec. 

Expected behavior The timestamps in the query result above are expected to be the same as those in the source dataframe.

Versions

xzkostyan commented 1 year ago

Is this similar with #257?

0liu commented 1 year ago

Is this similar with #257?

This issue leads to the incorrect data stored in database, while #257 is only about the read back format. With the #288 fix, this issue still exists.

Btw, I don't see #288 changes query results of query_dataframe. With settings={'use_numpy': False}, it always returns timezone aware column (truncated to microseconds), and with settings={'use_numpy': True} it returns naive datatime column of nanoseconds.

xzkostyan commented 1 year ago

Okay.

Does pure insert and select without pandas leads to incorrect data?

0liu commented 1 year ago

Does pure insert and select without pandas leads to incorrect data?

Inserting plain Python datetime seems good.

tz = pytz.timezone('America/Chicago')
plain_data = [
    [tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 5, 661537))],
    [tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 6, 334573))],
    [tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 7, 821988))],
]
client.execute('INSERT INTO debug_tbl (ts) VALUES', plain_data)
69aa60b1a9d2 :) describe table debug_tbl

DESCRIBE TABLE debug_tbl

Query id: 6d8bb04d-fe9c-4931-80e8-3925c26b412a

┌─name─┬─type─────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ ts   │ DateTime64(9, 'America/Chicago') │              │                    │         │                  │                │
└──────┴──────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

1 row in set. Elapsed: 0.002 sec. 

69aa60b1a9d2 :) select * from debug_tbl

SELECT *
FROM debug_tbl

Query id: 619d500e-9a45-469f-8153-5360c3b8d2fe

┌────────────────────────────ts─┐
│ 2023-06-01 11:28:05.661537000 │
│ 2023-06-01 11:28:06.334573000 │
│ 2023-06-01 11:28:07.821988000 │
└───────────────────────────────┘

3 rows in set. Elapsed: 0.003 sec.