pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.83k stars 1.92k forks source link

"Unable to parse time zone" when reading a Parquet File with Time Zone "utc" (lower case) #17547

Closed Nicolas-SB closed 2 months ago

Nicolas-SB commented 3 months ago

Checks

Reproducible example

from datetime import datetime

import pyarrow as pa
import pyarrow.parquet as pq

t_CAPS = pa.Table.from_pylist([{'time': datetime(2024, 6, 6)}], schema=pa.schema([pa.field("time", pa.timestamp("us", "UTC"))]))
pa.parquet.write_table(t_CAPS, "temp_CAPS.parquet")

t = pa.Table.from_pylist([{'time': datetime(2024, 6, 6)}], schema=pa.schema([pa.field("time", pa.timestamp("us", "utc"))]))
pa.parquet.write_table(t, "temp.parquet")

# This all works:

pa.parquet.read_table("temp_CAPS.parquet")
pa.parquet.read_table("temp.parquet")

import pandas as pd
pd.read_parquet("temp_CAPS.parquet")
pd.read_parquet("temp.parquet")

import polars as pl
pl.read_parquet("temp_CAPS.parquet")

# This does not work:

pl.read_parquet("temp.parquet")

Log output

ComputeError                              Traceback (most recent call last)
Cell In[1], line 14
     12 import polars as pl
     13 pl.read_parquet("temp_CAPS.parquet")
---> 14 pl.read_parquet("temp.parquet")

File ~/miniconda3/envs/eda/lib/python3.10/site-packages/polars/_utils/deprecation.py:135, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    130 @wraps(function)
    131 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    132     _rename_keyword_argument(
    133         old_name, new_name, kwargs, function.__name__, version
    134     )
--> 135     return function(*args, **kwargs)

File ~/miniconda3/envs/eda/lib/python3.10/site-packages/polars/_utils/deprecation.py:135, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    130 @wraps(function)
    131 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
    132     _rename_keyword_argument(
    133         old_name, new_name, kwargs, function.__name__, version
    134     )
--> 135     return function(*args, **kwargs)

File ~/miniconda3/envs/eda/lib/python3.10/site-packages/polars/io/parquet/functions.py:195, in read_parquet(source, columns, n_rows, row_index_name, row_index_offset, parallel, use_statistics, hive_partitioning, glob, hive_schema, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    192         columns = [lf.columns[i] for i in columns]
    193     lf = lf.select(columns)
--> 195 return lf.collect()

File ~/miniconda3/envs/eda/lib/python3.10/site-packages/polars/lazyframe/frame.py:1967, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1964 # Only for testing purposes atm.
   1965 callback = _kwargs.get("post_opt_callback")
-> 1967 return wrap_df(ldf.collect(callback))

ComputeError: unable to parse time zone: 'utc'. Please check the Time Zone Database for a list of available time zones

Issue description

We have a parquet file that was written by pyarrow with a column containing date, time and timezone. On the pyarrow side, this column has the type "pa.timestamp("us", "utc"). This parquet file can be read by pyarrow and pandas without problems, but polars throws the "unable to parse time zone" exception.

Our assumption is that in "_try_from_arrow_unchecked" the "validate_time_zone" does not know "utc" (lowercase) but expects "UTC" (uppercase).

Expected behavior

Since pyarrow accepts lowercase "utc", we expect polars to be able to read it as well.

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Linux-4.4.0-22621-Microsoft-x86_64-with-glibc2.35 Python: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: 24.2.1 hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: pandas: 2.2.1 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: xlsxwriter: ```
deanm0000 commented 3 months ago

@MarcoGorelli this one seems like it's right up your alley. @Nicolas-SB just a small nit, pandas doesn't have a native parquet reader, it just uses pyarrow so your pandas tests are just repeating pyarrow.

nikita-balyschew-db commented 3 months ago

+1

jstet commented 3 months ago

Hey everyone, may I try to fix this? :)

MarcoGorelli commented 3 months ago

sure @jstet go ahead