lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
879 stars 105 forks source link

Very high memory usage #220

Open fmoralesalcaide opened 1 year ago

fmoralesalcaide commented 1 year ago

Hi, I am experimenting excessive memory utilization of the library:

import numpy as np
X = np.random.rand(20_000_000, 2) # 20MM 2D vectors

index = pynndescent.NNDescent(X)
index.prepare()

The following code exceeds 30GB of memory in my machine. Surprisingly, other libraries like hnswlib are capable of building the same index in approximately 5GB of memory.

Is this a possible bug/memory leak in the library or is it the expected memory usage? Testing with version 0.5.10

roman-bushuiev commented 6 months ago

Hello @lmcinnes! Thank you for the amazing package, the speed and accuracy of the index are impressive! Unfortunately, however, I also have a memory issue. I am trying to build a 5-NN graph for 70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM even with the least-efficient setting.

pynn_knn = pynndescent.PyNNDescentTransformer(
    metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True
).fit_transform(my_vectors)

Could you advice us know to make PyNNDescent more memory-efficient? For example, would it make sense to decrease the values of the pruning_degree_multiplier and max_candidates parameters? In my use case, an accuracy drop is acceptable. Thank you.

lmcinnes commented 6 months ago

For a dataset that large I think you’ll be much better off using an index that uses quantization such as LanceDB or diskANN.

On Fri, Mar 22, 2024 at 2:03 PM Roman Bushuiev @.***> wrote:

Hello @lmcinnes https://github.com/lmcinnes! Thank you for the amazing package, the speed and accuracy of the index are impressive! Unfortunately, however, I also have a memory issue. I am trying to build a 5-NN graph for 70,000,000 1024-dimensional vectors but it does not fit into 1.75 TB of RAM even with the least-efficient setting.

pynn_knn = pynndescent.PyNNDescentTransformer( metric='cosine', n_neighbors=5, search_epsilon=0.25, n_jobs=1, low_memory=True, verbose=True ).fit_transform(my_vectors)

Could you advice us know to make PyNNDescent more memory-efficient? For example, would it make sense to decrease the values of the pruning_degree_multiplier and max_candidates parameters? In my use case, an accuracy drop is acceptable. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/pynndescent/issues/220#issuecomment-2015059561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBLMMSC2YVWWYYCJXJ3YZQTYTAVCNFSM6AAAAAAZMHUZ72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJVGA2TSNJWGE . You are receiving this because you were mentioned.Message ID: @.***>

roman-bushuiev commented 6 months ago

For a dataset that large I think you’ll be much better off using an index that uses quantization such as LanceDB or diskANN.

Thank you for your reply! In the end, I still used PyNNDescent by first creating two k-NN graphs for two halves of my data, clustering the nodes by BFS-based similarity cutoff, and then recreating the final k-NN graph for the cluster representatives. I found other packages less convenient for creating k-NN graphs, as they typically provide only indexing capabilities and the graphs have to be created by reiterating the whole index. 🙂

lmcinnes commented 6 months ago

That makes sense if you just want a kNN graph. The process you used is similar to the dask-distributed pynndescent that exists here in the distributed branch. If there is significant call for such an option I might try to finish it up and get it merged into main.

lsorber commented 4 months ago

If NNDescent uses over 30GB for an example 20M x 2 dataset (150MB in single precision), that does seem a bit much indeed. That's 200x more memory than the dataset itself.

lmcinnes commented 4 months ago

I agree that sounds excessive, but there's really nothing in the algorithm that would call for that unless something has gone astray, or memory hasn't been collected yet. Possibly there is a memory leak in the numba compiled material? It's hard to say for sure, but I haven't personalyl encountered such a siutation.

A last note; if you have 2D data you should be using a KD-Tree for exact neighbors.

lsorber commented 4 months ago

I ran a quick benchmark to profile NNDescent's memory usage:

%load_ext memory_profiler

import numpy as np
from pynndescent import NNDescent

X = np.random.randn(1_000_000, 768).astype(dtype=np.float32)
print(f"Array memory usage: {round(X.nbytes / 1024 ** 3, 3)} GB")

%memit NNDescent(X, metric="cosine", compressed=True)

Output:

Array memory usage: 2.861 GB
peak memory: 11293.77 MiB, increment: 8151.94 MiB (in 1m59.9s)

Doesn't look unreasonable to me: NNDescent uses about 4x as much memory as the dataset's size. Profiling was done on an M1 Max with pynndescent v0.5.12 and numba v0.59.1.