lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.95k stars 220 forks source link

Ability to duplicate data files? #2038

Closed billnye2 closed 8 months ago

billnye2 commented 8 months ago

Hi great project!

For testing, I am creating dummy rows, 1 billion. I want to duplicate that several times to test performance (e.g. on 10 billion rows), etc. So far, I am just using a producer to yield pyarrow recordbatches. I see in the filesystem that Lance progressively writes to a data file, so 1 billion rows comes out to 3.8G. Is there the ability to simply duplicate these data files and have the dataset pick them up? Or is there some api to do so? Duplicating at the file level is fast, takes just a few seconds. But writing a new file for each 1 billion rows is slow.

Thanks!

westonpace commented 8 months ago

If you don't mind dipping into advanced APIs you can probably achieve this with the fragment API:

Create a fragment

https://lancedb.github.io/lance/api/python/lance.html#lance.fragment.LanceFragment.create_from_file

Add those fragments to the dataset

https://lancedb.github.io/lance/api/python/lance.html#lance.dataset.LanceDataset.commit

billnye2 commented 8 months ago

Hmm, thought that would work but with this code:

fragment = lance.fragment.LanceFragment.create_from_file(
    "b69cc771-20e9-4039-9cc6-c99c240dcbf4.lance",
    schema,
    fragment_id=0,
)
operation = lance.LanceOperation.Overwrite(schema, [fragment])

I get this error:

TypeError: fragments must be list[FragmentMetadata], got <class 'lance._FragmentMetadata'>

Which occurs on the line creating the LanceOperation.

The create_from_file function says it should return a LanceFragment but it looks like it's actually returning a _FragmentMetadata.

billnye2 commented 8 months ago

I was able to figure it out thanks for your help again