Schefflera-Arboricola commented 9 months ago

All benchmarks were ran on (machine.json) :

{
    "arch": "x86_64",
    "cpu": "AMD EPYC-Milan Processor",
    "machine": "nx-benchmarks",
    "num_cpu": "4",
    "os": "Linux 5.15.0-92-generic",
    "ram": "15981980",
    "version": 1
}

For the above 3, chunking is better than no chunking for all the cases. For the below one it doesn't seem to matter if we use chunking or not.

Schefflera-Arboricola commented 9 months ago

How to preview these benchmarking results on your localhost?

clone this branch on your machine(clone my fork and git checkout bench_chunking)
cd nx-parallel/benchmarks and then copy results and html folders from the VM(you will need access to that from @MridulS )
```
scp -rC root@37.27.34.9:nx-parallel/benchmarks/results .
```
```
scp -rC root@37.27.34.9:nx-parallel/benchmarks/html .
```
run asv preview

Let me know if you face any errors.

Schefflera-Arboricola commented 9 months ago

Right now you will be able to see 3 benchmarks(betweenness_centrality, number_of_isolates and all_pairs_bellman_ford_path), and all_pairs_bellman_ford_path one is not accurate because I just found a bug in the chunking version of it, so i'll have to re-run it.

Initially, asv was giving the timeout error for 800 node graphs so I adjusted the default_benchmark_timeout to 1200 sec.. Also, the VM connection kept breaking because benchmarks took so long to execute so then making a tmux session helped but it was also not enough to run all the benchmarks so I started running them one at a time and it was fine for the above 3 algos but other algos were taking a lot of time to run and the VM connection would break in between. I am trying to find a way to run asv benchmarks in parts. But, is there a way I could increase the time I get connected to the VM?

Also, if you want you can have this as a separate branch in the repo but this should not be merged into the main branch, so I'm keeping it as draft PR.

Schefflera-Arboricola commented 9 months ago

Benchmark Results(node_chunk)

All benchmarks were run on (machine.json) :

{
    "arch": "x86_64",
    "cpu": "AMD EPYC-Milan Processor",
    "machine": "nx-benchmarks",
    "num_cpu": "4",
    "os": "Linux 5.15.0-92-generic",
    "ram": "15981980",
    "version": 1
}

Graph generation function :

G = nx.fast_gnp_random_graph(num_nodes, edge_prob, seed=42, directed=False)
# directed = True, only for all_pairs_bellman_ford_path
num_nodes = [50, 100, 200, 400, 800]
edge_prob = [0.8, 0.6, 0.4, 0.2]

# for tournament functions : 
G = nx.tournament.random_tournament(num_nodes, seed=42)

For betweenness_centrality the chunking was always better, and the speedups became better as the number of nodes increased:

For number_of_isolates there is not much difference with or without chunking :

Thanks!

dschult commented 9 months ago

Nice work to set this up in a way that runs on the remote server.

There are a few different reasons that the ssh session might disconnect. But if it is an issue of being "idle" (no activity for a certain period) then on your local machine set up your ssh config to use ServerAliveInterval and ServerAliveCountMax. There are server setting like ClientAliveInterval too if we get those set up. The idea is that ssh will send some traffic across the connection every ServerAliveInterval seconds. And it won't give up even if it doesn't get a response for ServerAliveCountMax times. Of course if your connection is actually broken it will eventually time out.

Set it in ~/.ssh/config (will apply user-only) or /etc/ssh/ssh_config (will apply system-wide):

host *
    KeepAlive yes
    ServerAliveInterval 60
    ServerAliveCountMax 120

# optional alias for easier login
host nickname
    hostname 37.27.34.9
    username root

You can change "nickname" to whatever you want to type instead of the ip number. I'm not sure if the KeepAlive line is necessary. But it is in my config.

Schefflera-Arboricola commented 9 months ago

For all_pairs_bellman_ford_path also the chunking is better than non-chunking versions, but it might be better to have non-chunking because it's unlikely someone would want all the dict :

[EDIT] Note: The above plot is incorrect. I'm trying to figure out the right way to integrate node_chunking in a function that returns a generator. Earlier I was using return inside _calculate_shortest_paths_subset and that just returned the shortest paths for the first node in the chunk. And we cannot use yield here either because then we get TypeError: cannot pickle 'generator' object. And returning a dictionary with return_as="generator" might not be the right way to do it and it also gives a wrong output,

def _calculate_shortest_paths_subset(nodes):
        d={}
        for source in nodes:
            d[source]= single_source_bellman_ford_path(G, source, weight=weight)
        return d

dschult commented 9 months ago

I see that the joblib context manager can maybe allow easy (without backend parameters) control of the number of jobs without changing the machine characteristics.

The parallel_config() context manager helps selecting a specific backend implementation or setting the default number of jobs:

This example appears in the joblib docs

from joblib import parallel_config
with parallel_config(backend='threading', n_jobs=2):
    Parallel()(delayed(sqrt)(i ** 2) for i in range(10))
# output:  [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

The text immediately after the example says:

The latter is especially useful when calling a library that uses [joblib.Parallel](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html#joblib.Parallel) internally without exposing backend selection as part of its public API.

So, maybe we can manage the number of jobs this way...

Schefflera-Arboricola commented 9 months ago

if a function is using the threading backend, then assuming that n_jobs is the number of cores is not good(bcoz n_job=number of threads here)
Also if we want to have "number of cpus" as a parameter then are we also planning to run the sequential implementations on this parameter or we can write additional benchmarks specific to parallel implementations to compare different joblib-related args(I think the latter would be better, bcoz running the same sequential implementation on different number of cpu_cores doesn't make much sense to me if only one core is used)

dschult commented 9 months ago

What is the practical difference between number of cpus and number of threads? If I split my nodes into 4 chunks and start processing each chunk in a separate thread, is that different (from the nx-joblib perspective) from starting each one on a separate cpu?

dschult commented 9 months ago

What are your overall results from this experiment timing node_chunking? Are there any lessons learned that we can use when deciding how to chunk a set of nodes?

Also, do you have a remote computer to run these on still? How is that connection timeout going?

Schefflera-Arboricola commented 9 months ago

What is the practical difference between number of cpus and number of threads? If I split my nodes into 4 chunks and start processing each chunk in a separate thread, is that different (from the nx-joblib perspective) from starting each one on a separate cpu?

yes, there is a difference. Threads are the smaller units of execution within a process and we can have shared resources in threading(joblib's backend) and not in loky(joblib's default backend that uses processes). Managing threads typically incurs less overhead than managing separate processes because threads share resources. But, excessive threading can lead to conflicts or unexpected behavior(like multiple threads attempting to access a shared resource simultaneously). Also, Python has the GIL(Global Interpreter Lock) that protects access to Python objects, preventing multiple threads from executing Python bytecodes simultaneously in the same process. This means that even on multi-core machine, only one thread can execute Python bytecode at a time in a given Python interpreter process. (ref. https://docs.python.org/3/glossary.html#term-global-interpreter-lock)

My concern was that if we want to have a parameter "number_of_cpus" in benchmarking, then we should not treat it as equivalent to the "n_jobs" in joblib, firstly because it's not, and secondly because "number_of_cpus" suggests we are changing the hardware, which we are not(we are changing n_jobs). We should instead call that parameter "n_jobs" itself, to avoid any confusion. Let me know what you think.

Schefflera-Arboricola commented 9 months ago

What are your overall results from this experiment timing node_chunking?

I updated the first comment in this PR with benchmark results of 4 algos and for the rest 2:

all_pairs_bellman_ford_path : It's not possible to generate a generator of generators in joblib(because the values returned from each parallel process/thread get pickled and you cannot pickle a generator object). So, I'm not sure what "chunking" even means for algorithms that return a generator or iterator. Also joblib does return a "normal" generator object when we set return_as="generator" but it evaluates the generator in parallel, I think, because it takes less time to convert a generator, returned by the non-chunking, parallel implementation, to get converted into a dict(using dict(paths)) as compared to the conversion time of a generator returned by the sequential implementation. So, I was trying to understand how joblib generates a generator that they evaluate in parallel and I thought that might help me understand better what "chunking" would mean in this context.
tournament.is_strongly_connected took a lot of time, more than 20 mins for even 100 node graph.(i'm trying to figure out why)

Are there any lessons learned that we can use when deciding how to chunk a set of nodes?

I don't think it would be fair to say anything based on just 4-5 algorithms. But I think we can say that chunking could be better than no chunking when we want to merge the results from each chunk afterwards and that merging is not trivial or inexpensive(based on betweenness_centrality and number_of_isolates). And I think we can discover more things as we add more algos. :)

Also, do you have a remote computer to run these on still? How is that connection timeout going?

I was able to increase the connection timeout by fixing the config you suggested in this comment. Thank you for that! And I added the results and html folders in this PR because I thought we will probably not merge this and it would be nice to have the results here if anyone would want to see them in the future. They are also in the VM.

Thank you :)

dschult commented 9 months ago

Thanks for the description of the difference between threads and jobs and cpus. I agree with you that we should be careful with the names of each concept. n_jobs is different from number_of_cpus.

About generators, IIUC, there are two ways joblib handles the generators: ordered and unordered. With ordered, the second generator has to wait until the first is done/stops before yielding its results. In the unordered case, objects are yielded in the order the processes complete their yielding with no waiting. Then the order things are yielded from the combined generator can change from run to run even with the same data. But my understanding of this is not from experience, just memories from reading the docs a few weeks ago. ;}

Schefflera-Arboricola commented 9 months ago

Quoting the doc:

Using return_as='generator' allows to progressively consume the outputs as they arrive and keeps the memory at an acceptable level. In this case, the output of the Parallel call is a generator that yields the results in the order the tasks have been submitted with. If the order of the tasks does not matter (for instance if they are consumed by a commutative aggregation function), then using return_as='generator_unordered' can be even more efficient.

ref. https://joblib.readthedocs.io/en/latest/auto_examples/parallel_generator.html#sphx-glr-auto-examples-parallel-generator-py

dschult commented 9 months ago

If you meant to close this, just close it again. But you didn't say you were closing it, so i think it was closed with a button click.

networkx / nx-parallel

Benchmarking chunking #42

How to preview these benchmarking results on your localhost?

Benchmark Results(node_chunk)