lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.77k stars 206 forks source link

Support data deletion in a Fragment #895

Open eddyxu opened 1 year ago

eddyxu commented 1 year ago

Problem Statement

Support deletion of rows on a Fragment level. This capability will be used as global Dataset-level deletion or distributed execution of delete operations.

We could expect a method with a signature:


impl Fragment {

   /// Returns a new Fragment representing the data being deleted. 
   ///
   /// If returns None, the full fragment is deleted.
   async fn delete(&self,  filter_expr: &str) -> Optional<Fragment> {
       ...
   }
}

Action items

wjones127 commented 1 year ago

On-disk mechanism

The Fragment.delete method will have four possible outcomes:

Outcome Storage representation
No rows deleted Fragment kept as is
Small number of rows deleted Integer index set
Large number of rows deleted Compressed bitmap
All rows deleted Fragment is dropped

The integer index set is an Arrow array of u32 values representing the indices that have been deleted.

The compressed bitmap is a bitmap where true means the value at that position is deleted. This will be implemented as a serialized Roaring Bitmap. We don't use roaring bitmap for small sets, because they recommend using integer arrays instead for sparse cases.

For now, the threshold at which we switch from integer index to bitmap should be configurable. Then we will be able to tune it in benchmarks to a good default value.

The integer index set or bitmap will be stored as a file:

_deletions/{fragment id}-{deletion id}.{arrow|bin}

where fragment id is the integer id of the fragment, and deletion id is a UUID. The .arrow extension can be used for the integer index, and the .bin extension for the bitmap.

Subsequent deletions need to read the existing bitmap file and bitwise & with the new deletion bitmap.

To track the current bitmap for a fragment, we can store the UUID as an additional field in as well as the type of deletion (bitmap or array):

https://github.com/lancedb/lance/blob/a141d66ddeb26543b06531b088a00a8644873ee7/protos/format.proto#L108-L113

changhiskhan commented 1 year ago

@wjones127 i think this is all completed right? If so, can you please close this ?

wjones127 commented 1 year ago

There are two lingering follow-ups, so I'll close that when those are done.