Open benmayersohn opened 5 days ago
I think what you are describing is maybe enum
and not categorical
?
Enum
considers the categories "fixed" and "part of the data" (e.g. the list of categories is important and constant)
Categorical
considers the categories "flexible" and "just a tool to compress data". As a result, if a batch doesn't need some categories (because they are never referenced) then we don't store them. Also, it's possible two batches have a different order to the categories.
We don't have support for Enum
in Lance but it wouldn't be too difficult to add I think.
Unfortunately, Arrow does not have any distinction between Categorical
and Enum
and Polars is not wrapping Enum
as "an extension type on top of categorical". As a result, the two arrays look identical when they are converted to Arrow and passed to us:
import polars as pl
import lance
import numpy as np
import pyarrow as pa
num_categories = 1000
num_rows = 10_000
categories = [str(x) for x in list(range(num_categories))]
df = pl.DataFrame({'a': np.random.randint(0, num_categories, num_rows)})
df = df.with_columns(pl.col('a').cast(pl.String).cast(pl.Enum(categories)))
print(df.to_arrow().schema.field(0))
# pyarrow.Field<a: dictionary<values=large_string, indices=uint32, ordered=0>>
df = df.with_columns(pl.col('a').cast(pl.String).cast(pl.Categorical()))
print(df.to_arrow().schema.field(0))
# pyarrow.Field<a: dictionary<values=large_string, indices=uint32, ordered=0>>
For now, I think we could offer a top-level flag in write_dataset which controls whether we use "enum style" or "categorical style" for storing arrays. Would that work for you? Or are you using both category and enum in your application and need them preserved?
Also, I don't actually know what polars will do on conversion back to polars. I.e. when converting from arrow to polars is there some way to flag that dictionary data should be considered "enum" vs "categorical"? Maybe you can provide a schema when converting from arrow to polars?
Are you using Lance's to_polars
? Or are you converting to polars yourself?
Thanks for the response! I'm only using Enum
, not Categorical
(I meant categorical in a general sense - sorry for the ambiguity).
A flag sounds good! I didn't know Lance had a to_polars
method - I use pl.from_arrow
after calling ds.take
on the Lance dataset. For now I'll try providing an explicit schema when converting to polars
and see if that works.
Great. Just to be clear, there's still work we'll need to do on our side (adding the flag and making sure we don't throw away levels) in addition to providing a schema. I'll try and find some time to get to it this week.
Great - thanks so much for your hard work on this excellent project!
I noticed a strange issue when trying to use
take
to load a subset of rows from a lance v2 dataset. The dataset has categorical columns. Inpolars
these are represented as aint32 -> large_string
mapping, but since large string dictionaries aren't currently supported in v2 (https://github.com/lancedb/lance/issues/2828), I convert the columns toint32 -> string
after converting thepolars
dataframe to anarrow
table. Then I save it as a lance v2 dataset.When I load one of these categorical columns in its entirety via
ds.to_table
and convert topolars
viapl.from_arrow
, the categorical column looks fine. But for certain subsets of rows, I end up with an incorrect number of levels. Here is a reproducible example:I don't encounter this issue when I save the dataset in v1 format:
Any help would be appreciated. Thanks!