metagraph-dev / dask-grblas

Distributed GraphBLAS via dask and python-graphblas
Apache License 2.0
6 stars 6 forks source link

Is `dask.grblas` ready for a connected components algorithm? #3

Open ParticularMiner opened 2 years ago

ParticularMiner commented 2 years ago

Hi @eriknw,

Many thanks for pioneering this and for all your impressive work on grblas!

Searching for "dask" and "GraphBLAS" together brought me here.

As part of an open-source home-project of mine I'm attempting to write a "connected components algorithm" for graphs too large to fit into the RAM of a standard laptop. This makes dask.arrays the natural choice of backend.

My starting point is "FastSV" (a distributed memory connected components algorithm) which can be found at LAGraph and written using the GraphBLAS API (the C implementation).

I'm about to peruse the source code of dask.grblas to figure out how it works. Still, if you don't mind me asking: in your opinion, is dask.grblas already at a stage to be used for my purposes? If not, and you don't mind me possibly contributing in the future, could you then share with me roughly what is outstanding?

Also, I'm guessing dask-grblas is intended to be used together with grblas, right?

If on the other hand, you feel there are other better ways of approaching my problem please do feel free to advise me.

Thanks!

eriknw commented 2 years ago

Hey, thanks for leaving a message @ParticularMiner, this is very interesting!

To answer your most pressing questions first: no, dask-grblas is not quite ready to be used for this purpose. It does not yet have a working matrix multiply (obviously very important), which is the next thing to work on. I think having a version of distributed, sparse matrix multiply is very doable (but it may not be the best in all circumstances), and I would love to push forward on this.

Other important missing functionality is assignment. dask.array recently added support for this, so it would probably be straightforward to add this to dask-grblas.

I think dask-grblas is a solid beginning, and it demonstrates that this approach can probably work. However, it hasn't been touched in over a year, and grblas, SuiteSparse:GraphBLAS, and the GraphBLAS specification have all advanced substantially. So, to push forward, one could first work on matrix multiply using libraries from 16 months ago, or one could first update dask-grblas to use the latest libraries.

And, yes, dask-grblas is intended to be used with grblas. It is also intended to mirror the API of grblas.

I think the next step (regardless of the state of dask-grblas) would be to write the connected components algorithm for yourself and run it on small data. Ideally, you would write this algorithm in a high level language such as Python with GraphBLAS or NumPy. To be honest, the LAGraph implementation of connected components looks gnarly to me, and I don't understand it. But, reading the paper for "FastFV" (and "LACC") makes me think it's implementable.

Once we have a working example, we could consider what to do next. If I just need to add a couple things to dask-grblas such as basic matrix multiply without masks or accumulation, then I'd want to do this. Another possibility would be to write a custom solution using dask.array that narrowly does only what we need. I'm pretty good with working with Dask (especially, weird, non-standard things!), and I would love to share my Dask knowledge if you're interested, so this option appeals to me too. There are probably other reasonable ways to approach this, but, well, I don't fully grok "FastSV" right now, so it's hard for me to say ;)

Btw, feel free to join and find me on the GraphBLAS slack channel: https://thegraphblas.slack.com

ParticularMiner commented 2 years ago

Many thanks for your interest @eriknw! That is better than I could have hoped for.

To be honest, the LAGraph implementation of connected components looks gnarly to me

It is gnarly isn’t it? Actually, to begin with, I intend to ignore a large part (≈40%) of that code, namely, the ‘sample phase’ part of the code (lines 400 - 692), since I’m not sure what exactly it is for, nor is it referred to in any of the authors’ papers.

396    //--------------------------------------------------------------------------
397    // sample phase
398    //--------------------------------------------------------------------------
399
400    if (sampling)
401    {
402
403        // et cetera

At first glance though, it seems to be a way of reducing the number of edges if that number is deemed to be too large (line 349).

349    bool sampling = (n * FASTSV_SAMPLES * 2 < nnz) ;

But the crucial part of the code, I believe, is barely 15 lines long and is what they call the “final phase” (lines 694-), which is really what I’m keen to implement using dask-grblas.

694    //--------------------------------------------------------------------------
695    // final phase
696    //--------------------------------------------------------------------------
697
698    GrB_TRY (GrB_Matrix_nvals (&nnz, T)) ;
699
700    bool change = true ;
701    while (change && nnz > 0)
702    {
703        // hooking & shortcutting
704        GrB_TRY (GrB_mxv (mngp, NULL, GrB_MIN_UINT32,
705                          GrB_MIN_SECOND_SEMIRING_UINT32, T, gp, NULL)) ;
706        // et cetera



I think the next step (regardless of the state of dask-grblas) would be to write the connected components algorithm for yourself and run it on small data. Ideally, you would write this algorithm in a high level language such as Python with GraphBLAS or NumPy.

Good idea. I'll try using grblas for that. This will be my starting point then. Would you perhaps want to review it as a pull request of an example application of grblas within grblas?

Another possibility would be to write a custom solution using dask.array that narrowly does only what we need.

I always find it instructive discovering alternative ways of solving a problem. So yes, of course it would be great if you/we could explore a custom dask solution. Then we could compare its performance to that of our “FastSV” implementation. For your information, scipy.sparse.csgraph already has a simple serial implementation (written in cython) which I'm already using for small graphs.

I'm pretty good with working with Dask (especially, weird, non-standard things!), and I would love to share my Dask knowledge if you're interested, so this option appeals to me too.

Excellent! Admittedly, my knowledge/skills in dask.array are still rudimentary. Still, I can work reasonably well with dask arrays consisting of scipy.sparse matrix chunks. But, for example, supplanting the usual addition and multiplication operators over integers in dask.array with a custom semiring is something I’m yet to explore and seems to require deeper knowledge, which you clearly have.

Btw, feel free to join and find me on the GraphBLAS slack channel: https://thegraphblas.slack.com

Sure. I'll be creating an account there soon.

Looking forward to a fruitful collaboration! 😄