monarch-initiative / embiggen

🍇 Embiggen is the Python Graph Representation learning, Prediction and Evaluation submodule of the GRAPE library.
BSD 3-Clause "New" or "Revised" License
41 stars 12 forks source link

MemoryError with GraphVisualizer #267

Closed caufieldjh closed 2 years ago

caufieldjh commented 2 years ago

With grape 0.1.0, loading a graph of 10.80M heterogeneous nodes and 30.45M heterogeneous edges works as expected but fails with a MemoryError when calling visualizer.fit_and_plot_all:

>>> g.remove_disconnected_nodes()
>>> embedding = model.fit_transform(g)
>>> visualizer = GraphVisualizer(g)
>>> visualizer.fit_and_plot_all(embedding.get_node_embedding_from_index(0))
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?63a4a1b7-07cd-4caa-b5d8-85945a763595)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
Input In [20], in <cell line: 1>()
----> 1 visualizer.fit_and_plot_all(embedding.get_node_embedding_from_index(0))

File ~/kg-env/lib/python3.8/site-packages/embiggen/visualizations/graph_visualizer.py:4150, in GraphVisualizer.fit_and_plot_all(self, node_embedding, number_of_columns, show_letters, include_distribution_plots, skip_constant_metrics, **node_embedding_kwargs)
   4121 def fit_and_plot_all(
   4122     self,
   4123     node_embedding: Union[pd.DataFrame, np.ndarray, str],
   (...)
   4128     **node_embedding_kwargs: Dict
   4129 ) -> Tuple[Figure, Axes]:
   4130     """Fits and plots all available features of the graph.
   4131 
   4132     Parameters
   (...)
   4148         Kwargs to be forwarded to the node embedding algorithm.
   4149     """
-> 4150     node_embedding = self._get_node_embedding(
   4151         node_embedding,
   4152         **node_embedding_kwargs
   4153     )
   4154     self.fit_nodes(node_embedding, **node_embedding_kwargs)
   4155     self.fit_negative_and_positive_edges(
   4156         node_embedding, **node_embedding_kwargs)
...
--> 241 result = np.asarray(values, dtype=dtype)
    243 if issubclass(result.dtype.type, str):
    244     result = np.asarray(values, dtype=object)

MemoryError: Unable to allocate 39.1 GiB for an array with shape (10804190,) and data type <U971

I'm not sure what happens when this much memory is available.

LucaCappelletti94 commented 2 years ago

Hello @caufieldjh, could you try now from the current develop branch? I should have successfully reduced the memory peak requirements. Also, now the GraphVisualizer accepts as input EmbeddingResult, so you do not need to call the get_node_embedding_from_index method anymore.

caufieldjh commented 2 years ago

Hi @LucaCappelletti94 - I tried the current develop version of embiggen with some SPINE embeddings on the same graph, just to get to the visualization stage faster. This time, it ran out of memory in the middle instead of complaining during loading:

>>> visualizer.fit_and_plot_all(embedding.get_node_embedding_from_index(0))
/home/harry/kg-env/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
/home/harry/kg-env/lib/python3.8/site-packages/sklearn/manifold/_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  warnings.warn(
Killed

I'm still using the get_node_embedding_from_index(0) as this is what happens when I don't:

>>> visualizer.fit_and_plot_all(embedding)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/harry/kg-env/lib/python3.8/site-packages/embiggen/visualizations/graph_visualizer.py", line 4151, in fit_and_plot_all
    node_embedding = self._get_node_embedding(
  File "/home/harry/kg-env/lib/python3.8/site-packages/embiggen/visualizations/graph_visualizer.py", line 676, in _get_node_embedding
    self._node_embedding_method_name = self.automatically_detect_node_embedding_method(
  File "/home/harry/kg-env/lib/python3.8/site-packages/embiggen/visualizations/graph_visualizer.py", line 826, in automatically_detect_node_embedding_method
    if node_embedding.dtype == "uint8" and node_embedding.min() == 0:
AttributeError: 'EmbeddingResult' object has no attribute 'dtype'
LucaCappelletti94 commented 2 years ago

Ok, thanks! I'm fixing the second one. Could you please try to re-run the first thing and hit the stop button when you see the memory starting to climb? So we can easily identify where that happens.

caufieldjh commented 2 years ago

Early stop from command line produces no output, and in a notebook the kernel just dies.

LucaCappelletti94 commented 2 years ago

Hi Harry, can we try to iterate on this with the new version?

caufieldjh commented 2 years ago

Sure! With grape-0.1.10 (ensmallen-0.8.8 and embiggen-0.11.18), repeating the same process as above (though with DegreeSPINE and calling visualizer.fit_and_plot_all(embedding)) works perfectly. Thanks!

LucaCappelletti94 commented 2 years ago

Perfect!