lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
901 stars 105 forks source link

RAM usage of NNDescent #90

Closed dkobak closed 4 years ago

dkobak commented 4 years ago

I am trying to understand how much RAM is used by NNDescent and what parameters affect the RAM usage.

I've been playing around with a 420,000x50 dataset, trying to construct kNN graph with k=100 on my laptop with 16Gb RAM. When I use Annoy (as part of FIt-SNE), it seems to use around ~1Gb of RAM for index construction. Pynndescent seems to use several times more than that and with some settings crashes with a memory error.

For example, NNDescent(X, n_neighbors=100, n_jobs=-1) crashes here:

    221     max_heap_update_count = chunk_size * leaf_array.shape[1] * leaf_array.shape[1] * 2
--> 222     heap_updates = np.zeros((n_tasks, max_heap_update_count, 4), dtype=np.float32)

with a

MemoryError: Unable to allocate array with shape (8, 1050000000, 4) and data type float32

I think 8 in this shape is the number of threads. With n_jobs=1 it runs without an error but feels quite slow (8 minutes). The third dimensions appears to be fixed at 4. Is there any way to decrease max_heap_update_count without compromising the performance too much?

Further, NNDescent(X, n_neighbors=100, n_jobs=-1, low_memory=True) crashes with the same error and the same array shape that it cannot allocate. How exactly does low_memory=True affect the RAM usage?

Funny enough, I did manage to run NNDescent() with n_neighbors=15 and n_jobs=-1, followed by query() with k=100. Together, this took ~2 minutes, so was 4x times faster than NNDescent(X, n_neighbors=100, n_jobs=1). Does it make sense to use this trick to save RAM?

lmcinnes commented 4 years ago

The parallelism obtained through n_jobs is very powerful, but also very memory intensive. The 0.4 release added an atlernative multi-threading approach using numba parallel. If you set n_jobs=None it should use this instead. This will be lower memory use (but still potentially quite high; I haven't really done much memory use optimization yet), and will have the threads bounded by the value of the environment variable NUMBA_NUM_THREADS akin to, say OMP.

dkobak commented 4 years ago

Let me see if I understand it correctly. I get the same performance and thread usage with n_jobs=None as with n_jobs=1 (despite nb.config.NUMBA_NUM_THREADS set to 8), and the docstring seems to confirm that None defaults to 1:

n_jobs: int or None, optional (default=None)
        The number of parallel jobs to run for neighbors index construction.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors.

Is this correct?

lmcinnes commented 4 years ago

Yes, the numba threading kicks in currently (in principle) if you have n_jobs=None or 1 as far as I know. This is, perhaps, less than ideal, but it is what works for now. The question is whether you actually see any multi-threading. Generally I see multiple core operating although no necessarily saturating for n_jobs=None, and lower memory usage than the equivalent n_jobs setting.

dkobak commented 4 years ago

I guess this can be closed for now. Thanks for the explanations, Leland!