Open Rxy-J opened 6 days ago
Looking back at the change logs, I'm not sure which change it would have been. I'm not aware of any memory leaks in Lance.
One thing you might be seeing is the file metadata cache being filled as your read more data.
If you are curious to try to show a memory leak, I have instructions for a memory debugging tool called bytehound
here: https://github.com/lancedb/lance/issues/2768#issuecomment-2303090222 (Have to open the collapsed "Self-contained reproduction" tab).
I tried to reproduce with the following and did not see any leak:
import os
import psutil
import lance
import pyarrow as pa
import shutil
shutil.rmtree("/tmp/my_dataset", ignore_errors=True)
tab = pa.table({
"x": range(1024 * 1024)
})
lance.write_dataset(tab, "/tmp/my_dataset")
del tab
if __name__ == "__main__":
ds = lance.dataset("/tmp/my_dataset")
p = psutil.Process(os.getpid())
while True:
for i in range(ds.count_rows()):
ds.take([i])
if i % 1000 == 0:
print(p.memory_full_info())
Looking back at the change logs, I'm not sure which change it would have been. I'm not aware of any memory leaks in Lance.
One thing you might be seeing is the file metadata cache being filled as your read more data.
If you are curious to try to show a memory leak, I have instructions for a memory debugging tool called
bytehound
here: #2768 (comment) (Have to open the collapsed "Self-contained reproduction" tab).
I did a simple test with bytehound. The results so far seem to be consistent with the performance described above. I tested versions 0.18.0 and 0.19.2. Different python versions do not seem to affect the results, so there is no strict guarantee of python version consistency. Each version was tested by repeatedly reading the same data 10,000 times and traversing the data set (only reading the first 10,000 records). The conclusions are as follows:
The dataset contains one million records.
Here is the log file of bytehound
While I was training with the lance, I noticed that the memory usage kept going up. After testing, I noticed that memory usage gradually increased as data was read. Even if I read the same data over and over, the memory usage would still go up. When I tested it on 0.18.0, 10k data would increase memory by about 50MB. After updating to 0.19.2, this amount was reduced to about 100KB. I didn't see any related records in the recent update log. I'm curious about the specific situation of this memory leak problem and why it wasn't completely fixed.
I tested it with the following code.