networkx / nx-parallel

A networkx backend that uses joblib to run graph algorithms in parallel.
BSD 3-Clause "New" or "Revised" License
23 stars 20 forks source link

Refactoring and Simplification Suggestions(based on PR #7) #30

Open Schefflera-Arboricola opened 8 months ago

Schefflera-Arboricola commented 8 months ago

I took a look at PR #7 and wanted to share some thoughts for consideration:

  1. Code Organization in external.py and misc.py: I suggest keeping most of the Backend class (in external.py) and moving the optional_package function (in misc.py) to external.py. My reasoning is that, we probably want to have networkx and joblib as our main dependencies and we would probably be just using other packages for joblib's backends, so it seems logical to have the optional_package function with the Backend class. Let me know if you see any potential drawbacks to this approach.

  2. Exclusion of NxReduce in partition.py: I propose excluding the NxReduce class from partition.py. Since most of its functions are one-liners, it might be more intuitive for new contributors to write these directly instead of using the NxReduce class. Additionally, this adjustment could enhance code readability, especially for those new to the library. I couldn't identify any scenarios where we'd prefer NxReduce over a simple reduce(lambda ...). Please share your insights on this.

  3. Refactoring the NxMap class : For the NxMap class, I think we would have sufficient instances where it would be useful. But, maybe it would be better to have 2 independent functions(excluding __call__, bcoz one-line function) instead of one NxMap class, like if we want to have the get_chunks parameter(in PR #29) that lets the user define their own node_chunks then we would not be requiring chunks, but we might require create_iterables function independently(depending on the algo). Please let me know your thoughts on this.

Thank you :)

dschult commented 8 months ago

Thanks for this summary. It has been a while since I looked at that PR. I think it is worth thinking about options for commonly used code to make it easier to write new functions -- and also easier to read once you know the commonly used parts. I'm sure that we don't want to do all the suggestions in that PR and I'm sure we probably should be doing some or even many of them. Using functions instead of classes is also a fine way to approach it. Classes are often better when there is data that is very close to many functions -- the class avoids having to pass the data around a lot. But functions are often better as building blocks that we can put together to do complicated things where each part is simple.

We should probably try several approaches to identify which is the best -- and we should know that our personal preference will color the results. There probably isn't a "right way". But I'm sure tere are non-easy ways. :) Overall your suggestions seem reasonable.

Schefflera-Arboricola commented 6 months ago

Some Side Notes:

Here, Map and Reduce come from MapReduce architecture(ref. 1, 2). It is a distributed programming model. It breaks down tasks into map and reduce phases, where the map tasks process input data and produce intermediate results, which are then aggregated and processed by reduce tasks to generate the final output.

As of now, there are no distributed graph algorithms in the repo. All are parallel. Because for joblib.Parallel the sub-tasks run concurrently, they are within the same computing environment, making it a parallel computing strategy rather than a distributed one(ref. 3)(but I'm not 100% sure about this!)