Open 0liu opened 1 year ago
Is this similar with #257?
Is this similar with #257?
This issue leads to the incorrect data stored in database, while #257 is only about the read back format. With the #288 fix, this issue still exists.
Btw, I don't see #288 changes query results of query_dataframe
. With settings={'use_numpy': False}
, it always returns timezone aware column (truncated to microseconds), and with settings={'use_numpy': True}
it returns naive datatime column of nanoseconds.
Okay.
Does pure insert and select without pandas leads to incorrect data?
Does pure insert and select without pandas leads to incorrect data?
Inserting plain Python datetime seems good.
tz = pytz.timezone('America/Chicago')
plain_data = [
[tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 5, 661537))],
[tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 6, 334573))],
[tz.localize(datetime.datetime(2023, 6, 1, 11, 28, 7, 821988))],
]
client.execute('INSERT INTO debug_tbl (ts) VALUES', plain_data)
69aa60b1a9d2 :) describe table debug_tbl
DESCRIBE TABLE debug_tbl
Query id: 6d8bb04d-fe9c-4931-80e8-3925c26b412a
┌─name─┬─type─────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ ts │ DateTime64(9, 'America/Chicago') │ │ │ │ │ │
└──────┴──────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘
1 row in set. Elapsed: 0.002 sec.
69aa60b1a9d2 :) select * from debug_tbl
SELECT *
FROM debug_tbl
Query id: 619d500e-9a45-469f-8153-5360c3b8d2fe
┌────────────────────────────ts─┐
│ 2023-06-01 11:28:05.661537000 │
│ 2023-06-01 11:28:06.334573000 │
│ 2023-06-01 11:28:07.821988000 │
└───────────────────────────────┘
3 rows in set. Elapsed: 0.003 sec.
Describe the bug When inserting a pandas dataframe to a Clickhouse table, and a DateTime column in the table is defined with a timezone other than UTC, timezone-aware datatime columns in the pandas dataframe will be incorrectly localized and inserted as UTC time. The actual timestamps stored in the database are shifted from the original timestamps.
This could be due to the incorrect timezone localization and conversion in apply_timezones_before_write
The
items
returned fromblock.get_column_by_index
should be UNIX timestamps and are always in UTC, butpd.to_datetime(items).tz_localize(timezone)
localizes the timestamp to the timezone defined in the table column ('America/Chicago' in the example below) incorrectly, and then converts it back to UTC. If commenting out these two parts as below, the problem is gone:The workaround works with datetime table column defined in UTC or other timezones.
To Reproduce
The returned query results are:
But it should return
In the command line, clickhouse-client shows the inserted data are not correct:
Expected behavior The timestamps in the query result above are expected to be the same as those in the source dataframe.
Versions