pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.61k stars 1.99k forks source link

Cannot merge DeltaTable when predicate is Decimal #20009

Open ponychicken opened 3 days ago

ponychicken commented 3 days ago

Checks

Reproducible example

import polars as pl
from datetime import datetime, date, timedelta
from deltalake import DeltaTable

# Define schema
schema = {
    "timestamp": pl.Datetime(time_unit="us", time_zone="UTC"),
    "date": pl.Date,
    "lon": pl.Decimal(precision=9, scale=6),
    "lat": pl.Decimal(precision=9, scale=6),
    "altitude": pl.Decimal(precision=6, scale=1),
    "course": pl.Decimal(precision=4, scale=1),
    "heading": pl.Decimal(precision=4, scale=1),
    "speed": pl.Decimal(precision=4, scale=1),
    "name": pl.String,
    "s3_key": pl.String,
}

# Create sample data
data = {
    "timestamp": [datetime(2024, 3, 20, 12, 30, 0)],
    "date": [date(2024, 3, 20)],
    "lon": [122.123456],
    "lat": [41.987654],
    "altitude": [150.5],
    "course": [45.5],
    "heading": [90.0],
    "speed": [12.5],
    "name": ["NAME"],
    "s3_key": ["s3"],
}

# Create DataFrame with sample data and schema
df = pl.DataFrame(data, schema=schema)

# Write DataFrame as Delta table
df.write_delta("B")

# Write again, skipping duplicates
df.write_delta(
    "B",
    mode="merge",
    delta_merge_options={
        "predicate": """
        t.timestamp = s.timestamp 
        AND t.lat = s.lat 
        AND t.lon = s.lon
    """,
        "source_alias": "s",
        "target_alias": "t",
    }
).when_matched_update_all().when_not_matched_insert_all().execute()

Log output

Traceback (most recent call last):
  File "<stdin>", line 13, in <module>
  File ".venv/lib/python3.12/site-packages/deltalake/table.py", line 1800, in execute
    metrics = self._table.merge_execute(self._builder)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_internal.DeltaError: Generic DeltaTable error: Unable to convert expression to string

Issue description

If I modify the predicate to only check the timestamp or check a string, it will succeed

Expected behavior

Write should succeed

Installed versions

--------Version info--------- Polars: 1.15.0 Index type: UInt32 Platform: Linux-6.11 Python: 3.12.7 LTS CPU: False ----Optional dependencies---- adbc_driver_manager altair boto3 1.35.69 cloudpickle connectorx deltalake 0.21.0 fastexcel fsspec 2024.10.0 gevent google.auth great_tables matplotlib nest_asyncio numpy 2.1.3 openpyxl 3.1.5 pandas 2.2.3 pyarrow 18.1.0 pydantic 2.9.2 pyiceberg sqlalchemy 2.0.36 torch xlsx2csv xlsxwriter
ponychicken commented 3 days ago

This is probably a upstream bug: https://github.com/delta-io/delta-rs/issues/3033

ion-elgreco commented 16 hours ago

It seems we are missing a match arm for decimal to allow round tripping through the log