Open mjq2020 opened 2 months ago
I found an error in image loading. The open method of Image is lazy loading. After using load, the dataset will be loaded. After the fix, the test results on SSD are as follows. Lance is still slower. This time, Lance also loaded the image. The following is the loading operation of Lance:
ds = lance.dataset(uri)
for i in tqdm(range(118287)):
b = ds.take([i]).to_pydict()["image"][0]
im = Image.open(io.BytesIO(b))
im.load()
1. Test code
Test data
coco2017 training set images, a total of 118,287 images
Software and hardware information
pyarrow==15.0.0 pydantic==2.7.1 lancedb==0.10.1 pylance==0.14.1 numpy==1.26.3
Test results
When we compared the performance of lancedb and reading directly by file name, we found that lance reading on SSD is slower than file reading, while lancedb reading on HDD is faster and the difference is larger.batch_size is set to 16, The following are relevant screenshots of the test on two devices: in SSD: in HDD:
I would like to ask if this test result is reasonable, because this result is slightly different from this test result.And I want to know what caused this result, so that we can know in which cases to use lancedb in the future