Dataframe: access to dotted field names like SQL

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.96k stars 219 forks source link

Dataframe: access to dotted field names like SQL #1738

Closed pchalasani closed 10 months ago

pchalasani commented 11 months ago

In the where clause, the SQL query can access arbitrarily nested fields, but this is not possible with the dataframe, e.g. I want to be able to do:

df = tbl.search().to_pandas()

and be able to see all nested fields of the schema as top-level columns in df. Among other things this would enable pandas queries like df.query(...)

wjones127 commented 11 months ago

Hmm if I could I would add a flatten parameter to to_pandas(), but we don't control that method (it's in PyArrow).

Other DataFrames do have decent support for nested columns, such as Polars. So I don't think flattening in general is what we want.

Perhaps we can provide a helpful snippet to teach them how to unflatten a column? IIRC it's just something like:

df.assign(nested = lambda df: [x['key'] for x in df['struct']])

changhiskhan commented 11 months ago

@pchalasani i think we can do this in LanceDB repo instead of the format level (please see the referencing PR)

pchalasani commented 11 months ago

Nice, thanks, I seem to be conflating the two repos in my mind 😀

wjones127 commented 11 months ago

For future reference for Lance users, you can write:

dataset.to_table(...).flatten().to_pandas()

If you have multiple levels of nested fields, you may need to call flatten() multiple times.

Maybe I can make this a tip in the user guide?

pchalasani commented 11 months ago

Yes an example in the user guide would help, thanks On Dec 20, 2023 at 12:00 PM -0500, Will Jones @.***>, wrote:

For future reference for Lance users, you can write: dataset.to_table(...).flatten().to_pandas() If you have multiple levels of nested fields, you may need to call flatten() multiple times. Maybe I can make this a tip in the user guide? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>