Closed billnye2 closed 8 months ago
If you don't mind dipping into advanced APIs you can probably achieve this with the fragment API:
Create a fragment
https://lancedb.github.io/lance/api/python/lance.html#lance.fragment.LanceFragment.create_from_file
Add those fragments to the dataset
https://lancedb.github.io/lance/api/python/lance.html#lance.dataset.LanceDataset.commit
Hmm, thought that would work but with this code:
fragment = lance.fragment.LanceFragment.create_from_file(
"b69cc771-20e9-4039-9cc6-c99c240dcbf4.lance",
schema,
fragment_id=0,
)
operation = lance.LanceOperation.Overwrite(schema, [fragment])
I get this error:
TypeError: fragments must be list[FragmentMetadata], got <class 'lance._FragmentMetadata'>
Which occurs on the line creating the LanceOperation.
The create_from_file
function says it should return a LanceFragment
but it looks like it's actually returning a _FragmentMetadata
.
I was able to figure it out thanks for your help again
Hi great project!
For testing, I am creating dummy rows, 1 billion. I want to duplicate that several times to test performance (e.g. on 10 billion rows), etc. So far, I am just using a producer to yield pyarrow recordbatches. I see in the filesystem that Lance progressively writes to a data file, so 1 billion rows comes out to 3.8G. Is there the ability to simply duplicate these data files and have the dataset pick them up? Or is there some api to do so? Duplicating at the file level is fast, takes just a few seconds. But writing a new file for each 1 billion rows is slow.
Thanks!