[bug] Writing a column of type list with nulls results in the nulls being replaced with []

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.96k stars 220 forks source link

[bug] Writing a column of type list with nulls results in the nulls being replaced with [] #1946

Closed mkleinbort-ic closed 4 months ago

mkleinbort-ic commented 9 months ago

Writing a table with a column of type list[int] containing nulls results in the nulls being filled in with []


df_test_before = pl.DataFrame({
    'x': [None, [1,2,3], []]
})

shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ null      │
│ [1, 2, 3] │
│ []        │
└───────────┘

df_test_after = pl.from_arrow(lance.write_dataset(df_test_before, 'df_test.lance', mode='overwrite').to_table())

shape: (3, 1)
┌───────────┐
│ x         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ []        │
│ [1, 2, 3] │
│ []        │
└───────────┘

changhiskhan commented 9 months ago

@westonpace is working on null support for plain encoder currently. I would expect this to land in a week or so. @westonpace is there extra work required to support nulls in list types?

westonpace commented 9 months ago

:cold_sweat: I don't know about a week or so. I hope the encoders and MVP version of the v2 file writer will land in a week or so. However, I think there is still some work to go before everything percolates up to the top-level APIs (need to integrate the new format with the scanner, etc.) Maybe the end of the month is more realistic for when users can start using these features.

@westonpace is there extra work required to support nulls in list types?

From the user perspective or from a development perspective?

Users shouldn't have to do anything. Once they upgrade Lance to the appropriate version it should just support writing nulls (any old files written with the old format will still read nulls back as empty lists, there is no way to recover them).

westonpace commented 9 months ago

https://github.com/lancedb/lance/issues/1929 is the tracking issue for the new format version

mkleinbort commented 9 months ago

Thank you both, I'll keep a close eye on this. Keen to migrate to lance, pending this fix.

mkleinbort commented 8 months ago

How is this coming along? I see there is a lot to do in the writer V2 issue.

mkleinbort commented 5 months ago

Do you know an estimate for this feature - about to kick off some refactoring next month and would love to move to lance as part of it - but waiting on this at the moment.

wjones127 commented 5 months ago

The V2 format is in beta right now. I think if you want nullability it's a good time to try it out and migrate. More compressive encodings are coming soon.

mkleinbort-ic commented 5 months ago

I don't think this is working at the moment (0.12.1):

import polars as pl 
import lance

df_test_before = pl.DataFrame({
    'x': [None, [1,2,3], []]
})

lance.write_dataset(df_test_before, 'df_test.lance', mode='overwrite', use_legacy_format=False)

>>> PanicException: not yet implemented: Implement encoding for field Field(id=0, name=x, type=large_list, children=[Field(id=1, name=item, type=int64), ])

wjones127 commented 5 months ago

Hmm it might just be that we have it for list (what PyArrow defaults to) and not large list (what Polars defaults to). We should probably implement large list as well.

mkleinbort-ic commented 4 months ago

This seems to be fixed - closing the issue.