Open ParticularMiner opened 3 years ago
Hey, thanks for leaving a message @ParticularMiner, this is very interesting!
To answer your most pressing questions first: no, dask-grblas
is not quite ready to be used for this purpose. It does not yet have a working matrix multiply (obviously very important), which is the next thing to work on. I think having a version of distributed, sparse matrix multiply is very doable (but it may not be the best in all circumstances), and I would love to push forward on this.
Other important missing functionality is assignment. dask.array
recently added support for this, so it would probably be straightforward to add this to dask-grblas
.
I think dask-grblas
is a solid beginning, and it demonstrates that this approach can probably work. However, it hasn't been touched in over a year, and grblas
, SuiteSparse:GraphBLAS
, and the GraphBLAS specification have all advanced substantially. So, to push forward, one could first work on matrix multiply using libraries from 16 months ago, or one could first update dask-grblas
to use the latest libraries.
And, yes, dask-grblas
is intended to be used with grblas
. It is also intended to mirror the API of grblas
.
I think the next step (regardless of the state of dask-grblas
) would be to write the connected components algorithm for yourself and run it on small data. Ideally, you would write this algorithm in a high level language such as Python with GraphBLAS or NumPy. To be honest, the LAGraph implementation of connected components looks gnarly to me, and I don't understand it. But, reading the paper for "FastFV" (and "LACC") makes me think it's implementable.
Once we have a working example, we could consider what to do next. If I just need to add a couple things to dask-grblas
such as basic matrix multiply without masks or accumulation, then I'd want to do this. Another possibility would be to write a custom solution using dask.array
that narrowly does only what we need. I'm pretty good with working with Dask (especially, weird, non-standard things!), and I would love to share my Dask knowledge if you're interested, so this option appeals to me too. There are probably other reasonable ways to approach this, but, well, I don't fully grok "FastSV" right now, so it's hard for me to say ;)
Btw, feel free to join and find me on the GraphBLAS slack channel: https://thegraphblas.slack.com
Many thanks for your interest @eriknw! That is better than I could have hoped for.
To be honest, the LAGraph implementation of connected components looks gnarly to me
It is gnarly isn’t it? Actually, to begin with, I intend to ignore a large part (≈40%) of that code, namely, the ‘sample phase’ part of the code (lines 400 - 692), since I’m not sure what exactly it is for, nor is it referred to in any of the authors’ papers.
396 //--------------------------------------------------------------------------
397 // sample phase
398 //--------------------------------------------------------------------------
399
400 if (sampling)
401 {
402
403 // et cetera
At first glance though, it seems to be a way of reducing the number of edges if that number is deemed to be too large (line 349).
349 bool sampling = (n * FASTSV_SAMPLES * 2 < nnz) ;
But the crucial part of the code, I believe, is barely 15 lines long and is what they call the “final phase” (lines 694-), which is really what I’m keen to implement using dask-grblas
.
694 //--------------------------------------------------------------------------
695 // final phase
696 //--------------------------------------------------------------------------
697
698 GrB_TRY (GrB_Matrix_nvals (&nnz, T)) ;
699
700 bool change = true ;
701 while (change && nnz > 0)
702 {
703 // hooking & shortcutting
704 GrB_TRY (GrB_mxv (mngp, NULL, GrB_MIN_UINT32,
705 GrB_MIN_SECOND_SEMIRING_UINT32, T, gp, NULL)) ;
706 // et cetera
I think the next step (regardless of the state of dask-grblas) would be to write the connected components algorithm for yourself and run it on small data. Ideally, you would write this algorithm in a high level language such as Python with GraphBLAS or NumPy.
Good idea. I'll try using grblas
for that. This will be my starting point then. Would you perhaps want to review it as a pull request of an example application of grblas
within grblas
?
Another possibility would be to write a custom solution using dask.array that narrowly does only what we need.
I always find it instructive discovering alternative ways of solving a problem. So yes, of course it would be great if you/we could explore a custom dask
solution. Then we could compare its performance to that of our “FastSV” implementation. For your information, scipy.sparse.csgraph
already has a simple serial implementation (written in cython) which I'm already using for small graphs.
I'm pretty good with working with Dask (especially, weird, non-standard things!), and I would love to share my Dask knowledge if you're interested, so this option appeals to me too.
Excellent! Admittedly, my knowledge/skills in dask.array
are still rudimentary. Still, I can work reasonably well with dask
arrays consisting of scipy.sparse
matrix chunks. But, for example, supplanting the usual addition and multiplication operators over integers in dask.array
with a custom semiring is something I’m yet to explore and seems to require deeper knowledge, which you clearly have.
Btw, feel free to join and find me on the GraphBLAS slack channel: https://thegraphblas.slack.com
Sure. I'll be creating an account there soon.
Looking forward to a fruitful collaboration! 😄
Hi @eriknw,
Many thanks for pioneering this and for all your impressive work on
grblas
!Searching for "dask" and "GraphBLAS" together brought me here.
As part of an open-source home-project of mine I'm attempting to write a "connected components algorithm" for graphs too large to fit into the RAM of a standard laptop. This makes
dask.arrays
the natural choice of backend.My starting point is "FastSV" (a distributed memory connected components algorithm) which can be found at LAGraph and written using the GraphBLAS API (the C implementation).
I'm about to peruse the source code of
dask.grblas
to figure out how it works. Still, if you don't mind me asking: in your opinion, isdask.grblas
already at a stage to be used for my purposes? If not, and you don't mind me possibly contributing in the future, could you then share with me roughly what is outstanding?Also, I'm guessing
dask-grblas
is intended to be used together withgrblas
, right?If on the other hand, you feel there are other better ways of approaching my problem please do feel free to advise me.
Thanks!