lancedb / lancedb

Developer-friendly, serverless vector database for AI applications. Easily add long-term memory to your LLM apps!
https://lancedb.github.io/lancedb/
Apache License 2.0
4.69k stars 325 forks source link

bug(python): null values do not preserve after write and read #1325

Open mutecamel opened 5 months ago

mutecamel commented 5 months ago

LanceDB version

0.8.0

What happened?

My dataset have null values. I expect I can retrieve the same dataset from lancedb storage, but in fact the nulls are filled. This occurs in lance, too, so I will also report to upstream.

Are there known steps to reproduce?

from datetime import date
import shutil
import polars as pl
import lancedb

path = "./tmp_db"
shutil.rmtree(path)

df = pl.DataFrame(
    [
        pl.Series("a", ["foo", None], pl.String),
        pl.Series("b", [None, 42], pl.Int64),
        pl.Series("c", [date(2024, 5, 26), None], pl.Date),
    ]
)
print(f"{df=}")

db = lancedb.connect(path)
tb = db.create_table("tmp_tb", df)
d1 = tb.to_polars().collect()
print(f"{d1=}")
assert df.frame_equal(d1)
df=shape: (2, 3)
┌──────┬──────┬────────────┐
│ a    ┆ b    ┆ c          │
│ ---  ┆ ---  ┆ ---        │
│ str  ┆ i64  ┆ date       │
╞══════╪══════╪════════════╡
│ foo  ┆ null ┆ 2024-05-26 │
│ null ┆ 42   ┆ null       │
└──────┴──────┴────────────┘
d1=shape: (2, 3)
┌──────┬─────┬────────────┐
│ a    ┆ b   ┆ c          │
│ ---  ┆ --- ┆ ---        │
│ str  ┆ i64 ┆ date       │
╞══════╪═════╪════════════╡
│ foo  ┆ 0   ┆ 2024-05-26 │    <- 0 should be null
│ null ┆ 42  ┆ 1970-01-01 │    <- 1970-01-01 should be null
└──────┴─────┴────────────┘
Traceback (most recent call last):
  File "/Users/qiang/repo/edb/bug_report.py", line 22, in <module>
    assert df.frame_equal(d1)
AssertionError
changhiskhan commented 5 months ago

Hey for numeric columns null support has not yet shipped. It's enabled in the new experimental writer, but we just tested there seems to be a bug still. We'll make a bug fix release soon. And soon after that we will aim to make the new writer default

changhiskhan commented 5 months ago

If you need a workaround in the meantime, would using a sentinel value be possible for now?

mutecamel commented 5 months ago

Yes. For now I generate mask boolean columns to indicate null values as a workaround. Thank you for your hard work!