rafguns / linkpred

Easy link prediction tool
Other
140 stars 46 forks source link

How to do link prediction on a large graph without 'MemoryError' #24

Open dbvdb opened 5 years ago

dbvdb commented 5 years ago

Hi, I used this library and what i wanna really do is to load a large graph and use simrank for link prediction, but i get the following error:

Traceback (most recent call last):
  File "pred.py", line 12, in <module>
    results = model.predict(c=0.4)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/predictors/base.py", line 64, in predict_and_postprocess
    scoresheet = func(*args, **kwargs)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/predictors/eigenvector.py", line 88, in predict
    sim = simrank(self.G, nodelist, c, num_iterations, weight)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/network/algorithms.py", line 73, in simrank
    M = raw_google_matrix(G, nodelist=nodelist, weight=weight)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/linkpred/network/algorithms.py", line 87, in raw_google_matrix
    weight=weight)
  File "/home/danial/Envs/graph/lib/python3.6/site-packages/networkx/convert_matrix.py", line 369, in to_numpy_matrix
    M = np.zeros((nlen,nlen), dtype=dtype, order=order) + np.nan
MemoryError

could you help me?

rafguns commented 5 years ago

What version of networkx is this? Please make sure you have version 2.1 or more recent. Can you see if that solves the problem? Current code in networkx seems more efficient.

What could be done in linkpred, is let raw_google_matrix return a scipy sparse matrix. That should trim down memory usage of this part of the code. I can imagine, however, that very large graphs will remain hard to process, mainly because linkpred uses networkx data structures for graphs, which are implemented in pure Python.

dbvdb commented 5 years ago

I used networkx==2.3 and linkpred==0.4.1

rafguns commented 5 years ago

Are you sure about networkx? I am asking because this line in your stack trace

  File "/home/danial/Envs/graph/lib/python3.6/site-packages/networkx/convert_matrix.py", line 369, in to_numpy_matrix
    M = np.zeros((nlen,nlen), dtype=dtype, order=order) + np.nan

does not occur in current versions of networkx (it changed in networkx/networkx@8de67c08).

But this point is a bit moot because it will probably not fundamentally change the issue of memory usage. See my previous comment for some thoughts on that.