serializing umap crashes application because of exploding memory

KukumavMozolo commented 5 months ago

Hi!!, i am trying to serialize a trained umap model with pickle.dumps. Unfortunately there is something going wrong, memory is exploding from 5gb to > 252gb and for some reason the following outputs are printed when executing io_bytes_array_data = dumps(umap) and the hole thing crashes as it exceeds my memory.

Fri May 24 08:59:54 2024 Worst tree score: 0.89628173
Fri May 24 08:59:54 2024 Mean tree score: 0.90186788
Fri May 24 08:59:54 2024 Best tree score: 0.90871308
Fri May 24 09:00:22 2024 Forward diversification reduced edges from 3233760 to 525053
Fri May 24 09:00:25 2024 Reverse diversification reduced edges from 525053 to 525053
Fri May 24 09:00:26 2024 Degree pruning reduced edges from 537484 to 536678
Fri May 24 09:00:26 2024 Resorting data and graph based on tree order
Fri May 24 09:00:26 2024 Building and compiling sparse search function

Apparently there is some code executed while pickle does its thing that probably should not happen.

I managed to create a minimal example that also generates this kind of output when using the pickle.dumps method. However this does not explode the memory since that might also depend on the size of the matrix that is fed into umap. It only happens when approximation_algorithm is run.

from umap import UMAP
import numpy as np
from pickle import dumps
a = np.array([[1,2,0],[0,1,3],[1,1,3],[1,0,1]])
umap = UMAP(
    verbose=True,
    force_approximation_algorithm=True,
    n_epochs=11)
umap.fit(a)
io_bytes_array_data = dumps(umap)

In my real use-case i am feeding a scipy.sparse.csr_matrix into umap.

KukumavMozolo commented 5 months ago

When trying to serialize it to disc using joblib memory consumption increases to 106gb and then crashes in my case becouse the harddisk was full: joblib.dump(umap,"umap.pcl")

    126         pickler.file_handle.write(padding)
    128 for chunk in pickler.np.nditer(array,
    129                                flags=['external_loop',
    130                                       'buffered',
    131                                       'zerosize_ok'],
    132                                buffersize=buffersize,
    133                                order=self.order):
--> 134     pickler.file_handle.write(chunk.tobytes('C'))

OSError: [Errno 28] No space left on device

on the fs umap.pcl was 67GB, Could it be that some reason when serializing the input csr_matrix gets converted to a dense matrix?

KukumavMozolo commented 5 months ago

So apparently when serializing, joblib will call this function

rp_trees.py 1549 convert_tree_format

and here the following line produces an error:

hyperplanes = np.zeros((n_nodes, 2, hyperplane_dim), dtype=np.float32)

numpy.core._exceptions._ArrayMemoryError:

Here hyperplane_dim seems to be the same as my dataset dimension and since that is over a million _ArrayMemoryError is thrown. Is there a way to prevent this? E.g. writing a custom save and load method? My use-case would be that i need to use the transform method on unseen data. I suspect umap uses pynndescent in case of high dimensional sparse data. Maybe i could just store the inputs and and learned embeddings from umap and then load these into pynndescent on the remote machine and use that instead of umap? Would that work with sparse data and maybe some pointers on how to do that most faithfully to umap?

lmcinnes / umap

serializing umap crashes application because of exploding memory #1125