laminlabs / lamindb

A data framework for biology.
https://docs.lamin.ai
Apache License 2.0
120 stars 9 forks source link

MappedCollection is leaking memory #1814

Open agemagician opened 4 weeks ago

agemagician commented 4 weeks ago

Hello,

Using the MappedCollection, I am trying to load a big h5ad file that can't fit into memory. It works as expected and doesn't initially load the whole file into memory. However, if I started to read the data row by row, the memory utilization increased to the same file size.

I expect the MappedCollection to read each row and free its associated memory after it is deleted, which is not the case.

How can we solve this issue?

Code:

import lamindb
from tqdm import tqdm

lam_db = lamindb.core.MappedCollection(
    ["file.h5ad"],
    parallel=False
)

for idx in tqdm(range(lam_db.shape[0])):
    row = lam_db[idx]["X"]
    del row
Koncopd commented 4 weeks ago

Hi, @agemagician , thanks for reporting this. I haven't in general observed this behavior, i was able to sample from very big collections of files, however i haven't really explored loading from one very large file. Need to investigate. Is adata.X sparse or dense? Do you know what are the chunk sizes on .X if it is dense?

import h5py
file = h5py.File("file.h5ad", mode="r")
print(file["X"].chunks)
file.close()

I suspect this might happen if the chunk size is big. As you are consequently loading all chunks by iterating over indices one by one, maybe garbage collection doesn't happen for these chunks in time. Does it also happen if you try random access?