lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

Create an `add_columns` task #3138

Open westonpace opened 3 days ago

westonpace commented 3 days ago

We've now aligned the interfaces for Fragment.merge_columns and LanceDataset.add_columns. However, the process of using these APIs is fairly complex and, with features like balanced storage, may be getting even more complex.

I would like to prototype a "task" API similar to what we have for compaction. The basic usage would work like this:

## On head node
add_column_task = dataset.start_add_column(new_col_name, new_col_type)
results = []

## On workers
# can pickle and send add_column_task across network
results.append(add_column_task.add_ordinal_data(some_new_data, row_start))
# If you aren't adding a new value for every row you can also do
results.append(add_column_task.add_id_data(some_new_data, row_ids))

## On head node
commit_plan = add_column_task.plan_commit(results)
commit_results = []

## On workers
# can pickle commit tasks
for commit_task in commit_plan["tasks"]:
  commit_results.append(commit_task.execute())

## On head node
commit_plan.finish_commit(commit_results)

The workflow is as follows:

Compared to add_columns / merge_columns this has a few advantages: