python-graphblas / graphblas-algorithms

Graph algorithms written in GraphBLAS
Apache License 2.0
76 stars 4 forks source link

square_clustering output using networkx api #82

Closed danielverd closed 11 months ago

danielverd commented 12 months ago

digraphTCC2 = ga.Graph.from_networkx(digraphTCC) clus = nx.square_clustering(digraphTCC2) print(clus)

for i in clus: print(i)

I've just found graphblas-algorithms as a way to speed up computations on a networkx graph, and I've integrated it into my work for finding the square clustering coefficient of a graph. The speedup is immense, but I'm not sure how to interpret the result.

Based on the code snippet above (digraphTCC is a valid networkx graph), the result of the function, clus prints out as <graphblas_algorithms.classes.nodemap.NodeMap object at 0x00000206BC590550>.

I'm unaware how to interact with a NodeMap object, but when I iterate through it, the values seem to just be node labels. How should I interpret this result, and transform it into the dictionary of {node:clustering} that you would expect from the slow networkx version?

Thank you for your time!

eriknw commented 12 months ago

Thanks for the issue and question @danielverd, and great to hear "the speedup is immense"!

Take one: NodeMap is a MutableMapping, so you can use it like a dict such as clus[key] and for key, val in clus.items(). You can also call dict(clus) to convert it to a dict.

Remark: NodeMap also has .vector attribute that is the underlying GraphBLAS vector, and it has .key_to_id attribute that may be a dict that maps the vector index to the node labels.

Take two: it looks like we ought to improve the repr of NodeMap so the keys and values (and dict-like behavior) are more obvious.

Take three: more generally, I guess we ought to improve our documentation including providing examples.

Finally, I'm working very actively on improving dispatching in NetworkX and implementing backends, so please don't be shy about reporting issues, recommending algorithms, and general user-experience suggestions :)

danielverd commented 12 months ago

I'll report anything I can find.

I'm currently working on getting a publication finalized, but once that's over I could help out with this project since it's helped save me so much compute time lol.

One final question, do you have certainty that the outputs of square_clustering are identical to networkx? I've gotten a situation where the output becomes negative which shouldn't be possible when averaging out positive numbers (but that could be an issue outside of square_clustering).

MridulS commented 12 months ago

One final question, do you have certainty that the outputs of square_clustering are identical to networkx? I've gotten a situation where the output becomes negative which shouldn't be possible when averaging out positive numbers (but that could be an issue outside of square_clustering).

This library uses the networkx test suite to run the tests so it should be identical. If you do end up finding a corner case please do open an issue.

SultanOrazbayev commented 12 months ago

Yes, a reproducible example would help.

eriknw commented 12 months ago

Sounds great @danielverd! Good luck on the publication.

Does your graph have self-edges? As MridulS said, we do run against (and pass) networkx tests, but I don't think networkx tests square clustering with self-edges, and this is the only way I could think to get negative results (or maybe integer overflow?). How large is the graph, and what's the largest degree of a node?

danielverd commented 12 months ago

yeah, i've been messing around with it. [The issue might be that any zeroes come back as integers.] this doesn't seem to be the issue

I'll keep checking myself because I'm not entirely sure I can share the data I used that led to this issue. Sorry about that, but I'll see what I can find.

danielverd commented 12 months ago

How large is the graph, and what's the largest degree of a node?

260587 edges, 132922 nodes, and the highest degree is 42047.

eriknw commented 12 months ago

Perfect thanks.

I'm looking at the implementation more closely, and I think we should probably specify the dtype when using e.g. plus_pair semiring in algorithms. @jim22k was smart enough to do this in k_truss. Otherwise, the return dtype of plus_pair is determined by the inputs, which may be e.g. INT8, which may not behave well. Gonna try this out, and I'll push a bugfix release if necessary.

danielverd commented 12 months ago

Fantastic. Are there any workarounds you suppose I could try in the meantime?

eriknw commented 12 months ago

Are there any workarounds you suppose I could try in the meantime?

Sure. How are you using graphblas-algorithms? Do you start with a networkx Graph or a graphblas Matrix, and are you calling from networkx?

networkx graph G, calling from networkx

nx.square_clustering(G, backend="graphblas")  # no workaround needed (I hope)

graphblas Matrix A, calling from networkx

nx.square_clustering(A.dup(int))

So, I have been able to reproduce negative results using graphblas Matrix objects with dtype of int8 as inputs, so we do need to be more careful about dtypes during calculations. My bad! I got swamped today, but I will try to fix this tomorrow.

Also, I think we get the same results as networkx with the presense of self-loops (or matrix diagonals), so that probably isn't an issue.

danielverd commented 12 months ago

Sure. How are you using graphblas-algorithms? Do you start with a networkx Graph or a graphblas Matrix, and are you calling from networkx?

I've been using the Dispatch Example from the README, which converts from networkx to a ga.Graph and back after calling the function from networkx.

I just tried the first suggestion with the networkx graph and got this error from algorithms.square_clustering(G, chunk_ids) A, degrees = G.get_properties("A degrees+") KeyError: 'degrees+'

eriknw commented 12 months ago

I've been using the Dispatch Example from the README, which converts from networkx to a ga.Graph and back after calling the function from networkx.

Cool, thanks. Does the networkx graph have weights with "weight" key, and are they by chance boolean?

I just tried the first suggestion with the networkx graph and got this error [...]

Oh my! That shouldn't happen. Can you share the full stacktrace please (and a minimal reproducible example if it's not too much hassle)?

eriknw commented 11 months ago

Hey @danielverd, think I have a fix for this in https://github.com/python-graphblas/python-graphblas/pull/524

Specifically, these two lines are needed: https://github.com/python-graphblas/python-graphblas/pull/524/files#diff-b4b616572292444660e5f89b1a07a69d0f2d049cecd40c4c41b3dd82638bce53R79-R80

I'd like to release python-graphblas tomorrow (Wednesday) with this fix.

If you still see negative results after this, then we may need to investigate further into integer overflow when int64 is too small. If this happens, perhaps we can change the calculation to prevent overflow or figure out when it may occur and use float64. I may play around with some synthetic graphs around the same size as yours.

Note to observers: we would still like to improve the repr of NodeMap and other returned objects ;)

eriknw commented 11 months ago

@danielverd, python-graphblas version 2023.12.0 was released and is available from PyPI and conda-forge. I believe updating to this version should fix the issue with negative results for square clustering.

In my testing, square_clustering can safely handle graphs with largest node degree of 250,000 (even if they all have this degree, which is quite large--more than 62 billion edges!). I don't know the upper limit. If the graph has no self-edges (i.e., adjacency matrix has no values on the diagonal), then the results from square clustering should be within [0, 1], and if they are not then overflow must have occurred.

I created issues #85 and #86 to capture other tasks from this issue.

Closing, b/c I think all questions have been answered and issues have been fixed or captured. @danielverd, good luck! Please don't be shy about asking for help. Have you found any other fast square clustering algorithms around?