pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.44k stars 1.97k forks source link

`read_csv`/`read_database` incorrectly converts naive datetime to UTC before applying timezone from a given schema #18995

Open edwinvehmaanpera opened 1 month ago

edwinvehmaanpera commented 1 month ago

Checks

Reproducible example

import io
import polars as pl
csv = r'''ts,ts2
2024-09-25 11:54:17.403535,2024-09-25 11:54:17.403535'''
pl.read_csv(io.StringIO(csv), schema={"ts": pl.Datetime,"ts2": pl.Datetime(time_zone='Europe/Budapest')})
shape: (1, 2)
┌────────────────────────────┬─────────────────────────────────┐
│ ts                         ┆ ts2                             │
│ ---                        ┆ ---                             │
│ datetime[μs]               ┆ datetime[μs, Europe/Budapest]   │
╞════════════════════════════╪═════════════════════════════════╡
│ 2024-09-25 11:54:17.403535 ┆ 2024-09-25 13:54:17.403535 CES… │
└────────────────────────────┴─────────────────────────────────┘

I observed same behaviour with read_database but do not have a minimal example (was using psycopg2 / Postgres).

Log output

No response

Issue description

When using read_csv or read_database with the schema parameter to define a column as datetime with a timezone, the function incorrectly assumes that naive datetime values are in UTC. It then converts these UTC values to the specified timezone, changing the actual datetime values.

Expected behavior

Naive datetime values should retain their original time, and the specified timezone from given schema should be applied without conversion. Same way to_datetime works:

pl.Series("ts", ["2024-09-25T11:54:17.403535"]).str.to_datetime(time_zone='Europe/Budapest')
#shape: (1,)
#Series: 'ts' [datetime[μs, Europe/Budapest]]
#[
#   2024-09-25 11:54:17.403535 CEST
#]

Installed versions

``` --------Version info--------- Polars: 1.8.2 Index type: UInt32 Platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel fsspec gevent great_tables matplotlib nest_asyncio 1.6.0 numpy openpyxl pandas pyarrow pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
LAMagicx commented 1 month ago

I'd like to start contributing in open source and was wondering if this is a doable first issue ? If so, could I give it a go ?