pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.31k stars 1.96k forks source link

Reading delta table fails due to failing to cast time variable according to PyArrow parquet #16969

Open Dekermanjian opened 5 months ago

Dekermanjian commented 5 months ago

Checks

Reproducible example

I believe this occurs when you save a timestamp variable as a nanosecond unit. PyArrow tries to convert it to us units and throws an exception. I also believe that there are some arguments you can pass to PyArrow to coerce the timestamp. See here https://github.com/apache/arrow/issues/1920

Log output

ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data:

Issue description

Is there currently a way to get around this issue?

Expected behavior

To be able to pass some argument to pyarrow to coerce the timestamp field.

Installed versions

``` --------Version info--------- Polars: 0.20.31 Index type: UInt32 Platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.31 Python: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: 2.2.1 connectorx: deltalake: 0.18.0 fastexcel: fsspec: 2023.5.0 gevent: hvplot: matplotlib: 3.8.4 nest_asyncio: 1.6.0 numpy: 1.26.4 openpyxl: 3.1.3 pandas: 2.2.2 pyarrow: 15.0.2 pydantic: pyiceberg: pyxlsb: sqlalchemy: torch: xlsx2csv: 0.8.2 xlsxwriter: ```
Dekermanjian commented 3 months ago

I am still having a hard time with this issue. It seems that others are also experiencing this problem. Here is an open issue on delat-rs https://github.com/delta-io/delta-rs/issues/2593. Is this something that would be resolved if the issue with delat-rs is resolved?

hwanii0329 commented 3 weeks ago

When I scan delta lake table, I use a option like that:

import polars as pl

df = pl.scan_delta(
    file_path,
    pyarrow_options={"parquet_read_options": {"coerce_int96_timestamp_unit":"ms"}}
)

pyarrow_options parameter is used in delta-rs to_pyarrow_dataset, and you can find out how to coerce the timestamp. I hope this helps.