lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

[Just a question] Memory leak in reading data #3128

Open Rxy-J opened 6 days ago

Rxy-J commented 6 days ago

While I was training with the lance, I noticed that the memory usage kept going up. After testing, I noticed that memory usage gradually increased as data was read. Even if I read the same data over and over, the memory usage would still go up. When I tested it on 0.18.0, 10k data would increase memory by about 50MB. After updating to 0.19.2, this amount was reduced to about 100KB. I didn't see any related records in the recent update log. I'm curious about the specific situation of this memory leak problem and why it wasn't completely fixed.

I tested it with the following code.

import os
import psutil
import lance

if __name__ == "__main__":
    ds = lance.dataset("path_to_lance_set")
    p = psutil.Process(os.getpid())
    for i in range(ds.count_rows()):
        ds.take([i])
    if i % 100 == 0:
        print(p.memory_full_info())
wjones127 commented 3 days ago

Looking back at the change logs, I'm not sure which change it would have been. I'm not aware of any memory leaks in Lance.

One thing you might be seeing is the file metadata cache being filled as your read more data.

If you are curious to try to show a memory leak, I have instructions for a memory debugging tool called bytehound here: https://github.com/lancedb/lance/issues/2768#issuecomment-2303090222 (Have to open the collapsed "Self-contained reproduction" tab).

westonpace commented 3 days ago

I tried to reproduce with the following and did not see any leak:

import os
import psutil
import lance
import pyarrow as pa
import shutil

shutil.rmtree("/tmp/my_dataset", ignore_errors=True)

tab = pa.table({
    "x": range(1024 * 1024)
})
lance.write_dataset(tab, "/tmp/my_dataset")
del tab

if __name__ == "__main__":
    ds = lance.dataset("/tmp/my_dataset")
    p = psutil.Process(os.getpid())
    while True:
        for i in range(ds.count_rows()):
            ds.take([i])
            if i % 1000 == 0:
                print(p.memory_full_info())
Rxy-J commented 1 day ago

Looking back at the change logs, I'm not sure which change it would have been. I'm not aware of any memory leaks in Lance.

One thing you might be seeing is the file metadata cache being filled as your read more data.

If you are curious to try to show a memory leak, I have instructions for a memory debugging tool called bytehound here: #2768 (comment) (Have to open the collapsed "Self-contained reproduction" tab).

I did a simple test with bytehound. The results so far seem to be consistent with the performance described above. I tested versions 0.18.0 and 0.19.2. Different python versions do not seem to affect the results, so there is no strict guarantee of python version consistency. Each version was tested by repeatedly reading the same data 10,000 times and traversing the data set (only reading the first 10,000 records). The conclusions are as follows:

  1. The same lance version performs basically the same in different read tests.
  2. The memory usage growth rate of 0.18.0 is indeed significantly higher than that of 0.19.2, and both have increased.

The dataset contains one million records.

Here is the log file of bytehound