Open KukumavMozolo opened 5 months ago
When trying to serialize it to disc using joblib memory consumption increases to 106gb and then crashes in my case becouse the harddisk was full:
joblib.dump(umap,"umap.pcl")
126 pickler.file_handle.write(padding)
128 for chunk in pickler.np.nditer(array,
129 flags=['external_loop',
130 'buffered',
131 'zerosize_ok'],
132 buffersize=buffersize,
133 order=self.order):
--> 134 pickler.file_handle.write(chunk.tobytes('C'))
OSError: [Errno 28] No space left on device
on the fs umap.pcl was 67GB, Could it be that some reason when serializing the input csr_matrix gets converted to a dense matrix?
So apparently when serializing, joblib will call this function
rp_trees.py 1549 convert_tree_format
and here the following line produces an error:
hyperplanes = np.zeros((n_nodes, 2, hyperplane_dim), dtype=np.float32)
numpy.core._exceptions._ArrayMemoryError:
Here hyperplane_dim seems to be the same as my dataset dimension and since that is over a million _ArrayMemoryError is thrown. Is there a way to prevent this? E.g. writing a custom save and load method? My use-case would be that i need to use the transform method on unseen data. I suspect umap uses pynndescent in case of high dimensional sparse data. Maybe i could just store the inputs and and learned embeddings from umap and then load these into pynndescent on the remote machine and use that instead of umap? Would that work with sparse data and maybe some pointers on how to do that most faithfully to umap?
Hi!!, i am trying to serialize a trained umap model with pickle.dumps. Unfortunately there is something going wrong, memory is exploding from 5gb to > 252gb and for some reason the following outputs are printed when executing
io_bytes_array_data = dumps(umap)
and the hole thing crashes as it exceeds my memory.Apparently there is some code executed while pickle does its thing that probably should not happen.
I managed to create a minimal example that also generates this kind of output when using the pickle.dumps method. However this does not explode the memory since that might also depend on the size of the matrix that is fed into umap. It only happens when approximation_algorithm is run.
In my real use-case i am feeding a
scipy.sparse.csr_matrix
into umap.