Closed danielverd closed 11 months ago
Thanks for the issue and question @danielverd, and great to hear "the speedup is immense"!
Take one: NodeMap
is a MutableMapping
, so you can use it like a dict
such as clus[key]
and for key, val in clus.items()
. You can also call dict(clus)
to convert it to a dict
.
Remark: NodeMap
also has .vector
attribute that is the underlying GraphBLAS vector, and it has .key_to_id
attribute that may be a dict that maps the vector index to the node labels.
Take two: it looks like we ought to improve the repr of NodeMap
so the keys and values (and dict-like behavior) are more obvious.
Take three: more generally, I guess we ought to improve our documentation including providing examples.
Finally, I'm working very actively on improving dispatching in NetworkX and implementing backends, so please don't be shy about reporting issues, recommending algorithms, and general user-experience suggestions :)
I'll report anything I can find.
I'm currently working on getting a publication finalized, but once that's over I could help out with this project since it's helped save me so much compute time lol.
One final question, do you have certainty that the outputs of square_clustering are identical to networkx? I've gotten a situation where the output becomes negative which shouldn't be possible when averaging out positive numbers (but that could be an issue outside of square_clustering).
One final question, do you have certainty that the outputs of square_clustering are identical to networkx? I've gotten a situation where the output becomes negative which shouldn't be possible when averaging out positive numbers (but that could be an issue outside of square_clustering).
This library uses the networkx test suite to run the tests so it should be identical. If you do end up finding a corner case please do open an issue.
Yes, a reproducible example would help.
Sounds great @danielverd! Good luck on the publication.
Does your graph have self-edges? As MridulS said, we do run against (and pass) networkx tests, but I don't think networkx tests square clustering with self-edges, and this is the only way I could think to get negative results (or maybe integer overflow?). How large is the graph, and what's the largest degree of a node?
yeah, i've been messing around with it. [The issue might be that any zeroes come back as integers.] this doesn't seem to be the issue
I'll keep checking myself because I'm not entirely sure I can share the data I used that led to this issue. Sorry about that, but I'll see what I can find.
How large is the graph, and what's the largest degree of a node?
260587 edges, 132922 nodes, and the highest degree is 42047.
Perfect thanks.
I'm looking at the implementation more closely, and I think we should probably specify the dtype when using e.g. plus_pair
semiring in algorithms. @jim22k was smart enough to do this in k_truss
. Otherwise, the return dtype of plus_pair
is determined by the inputs, which may be e.g. INT8
, which may not behave well. Gonna try this out, and I'll push a bugfix release if necessary.
Fantastic. Are there any workarounds you suppose I could try in the meantime?
Are there any workarounds you suppose I could try in the meantime?
Sure. How are you using graphblas-algorithms
? Do you start with a networkx Graph or a graphblas Matrix, and are you calling from networkx?
G
, calling from networkxnx.square_clustering(G, backend="graphblas") # no workaround needed (I hope)
A
, calling from networkxnx.square_clustering(A.dup(int))
So, I have been able to reproduce negative results using graphblas Matrix objects with dtype of int8
as inputs, so we do need to be more careful about dtypes during calculations. My bad! I got swamped today, but I will try to fix this tomorrow.
Also, I think we get the same results as networkx with the presense of self-loops (or matrix diagonals), so that probably isn't an issue.
Sure. How are you using
graphblas-algorithms
? Do you start with a networkx Graph or a graphblas Matrix, and are you calling from networkx?
I've been using the Dispatch Example from the README, which converts from networkx to a ga.Graph and back after calling the function from networkx.
I just tried the first suggestion with the networkx graph and got this error from algorithms.square_clustering(G, chunk_ids)
A, degrees = G.get_properties("A degrees+")
KeyError: 'degrees+'
I've been using the Dispatch Example from the README, which converts from networkx to a ga.Graph and back after calling the function from networkx.
Cool, thanks. Does the networkx graph have weights with "weight"
key, and are they by chance boolean?
I just tried the first suggestion with the networkx graph and got this error [...]
Oh my! That shouldn't happen. Can you share the full stacktrace please (and a minimal reproducible example if it's not too much hassle)?
Hey @danielverd, think I have a fix for this in https://github.com/python-graphblas/python-graphblas/pull/524
Specifically, these two lines are needed: https://github.com/python-graphblas/python-graphblas/pull/524/files#diff-b4b616572292444660e5f89b1a07a69d0f2d049cecd40c4c41b3dd82638bce53R79-R80
I'd like to release python-graphblas
tomorrow (Wednesday) with this fix.
If you still see negative results after this, then we may need to investigate further into integer overflow when int64
is too small. If this happens, perhaps we can change the calculation to prevent overflow or figure out when it may occur and use float64
. I may play around with some synthetic graphs around the same size as yours.
Note to observers: we would still like to improve the repr of NodeMap
and other returned objects ;)
@danielverd, python-graphblas
version 2023.12.0 was released and is available from PyPI and conda-forge. I believe updating to this version should fix the issue with negative results for square clustering.
In my testing, square_clustering
can safely handle graphs with largest node degree of 250,000 (even if they all have this degree, which is quite large--more than 62 billion edges!). I don't know the upper limit. If the graph has no self-edges (i.e., adjacency matrix has no values on the diagonal), then the results from square clustering should be within [0, 1]
, and if they are not then overflow must have occurred.
I created issues #85 and #86 to capture other tasks from this issue.
Closing, b/c I think all questions have been answered and issues have been fixed or captured. @danielverd, good luck! Please don't be shy about asking for help. Have you found any other fast square clustering algorithms around?
digraphTCC2 = ga.Graph.from_networkx(digraphTCC)
clus = nx.square_clustering(digraphTCC2)
print(clus)
for i in clus:
print(i)
I've just found graphblas-algorithms as a way to speed up computations on a networkx graph, and I've integrated it into my work for finding the square clustering coefficient of a graph. The speedup is immense, but I'm not sure how to interpret the result.
Based on the code snippet above (digraphTCC is a valid networkx graph), the result of the function, clus prints out as
<graphblas_algorithms.classes.nodemap.NodeMap object at 0x00000206BC590550>
.I'm unaware how to interact with a NodeMap object, but when I iterate through it, the values seem to just be node labels. How should I interpret this result, and transform it into the dictionary of {node:clustering} that you would expect from the slow networkx version?
Thank you for your time!