Closed Schefflera-Arboricola closed 9 months ago
git checkout bench_chunking
)cd nx-parallel/benchmarks
and then copy results and html folders from the VM(you will need access to that from @MridulS )
scp -rC root@37.27.34.9:nx-parallel/benchmarks/results .
scp -rC root@37.27.34.9:nx-parallel/benchmarks/html .
asv preview
Let me know if you face any errors.
Right now you will be able to see 3 benchmarks(betweenness_centrality, number_of_isolates and all_pairs_bellman_ford_path), and all_pairs_bellman_ford_path one is not accurate because I just found a bug in the chunking version of it, so i'll have to re-run it.
Initially, asv was giving the timeout error for 800 node graphs so I adjusted the default_benchmark_timeout
to 1200 sec.. Also, the VM connection kept breaking because benchmarks took so long to execute so then making a tmux
session helped but it was also not enough to run all the benchmarks so I started running them one at a time and it was fine for the above 3 algos but other algos were taking a lot of time to run and the VM connection would break in between. I am trying to find a way to run asv benchmarks in parts. But, is there a way I could increase the time I get connected to the VM?
Also, if you want you can have this as a separate branch in the repo but this should not be merged into the main branch, so I'm keeping it as draft PR.
All benchmarks were run on (machine.json
) :
{
"arch": "x86_64",
"cpu": "AMD EPYC-Milan Processor",
"machine": "nx-benchmarks",
"num_cpu": "4",
"os": "Linux 5.15.0-92-generic",
"ram": "15981980",
"version": 1
}
Graph generation function :
G = nx.fast_gnp_random_graph(num_nodes, edge_prob, seed=42, directed=False)
# directed = True, only for all_pairs_bellman_ford_path
num_nodes = [50, 100, 200, 400, 800]
edge_prob = [0.8, 0.6, 0.4, 0.2]
# for tournament functions :
G = nx.tournament.random_tournament(num_nodes, seed=42)
For betweenness_centrality the chunking was always better, and the speedups became better as the number of nodes increased:
For number_of_isolates there is not much difference with or without chunking :
Thanks!
Nice work to set this up in a way that runs on the remote server.
There are a few different reasons that the ssh session might disconnect. But if it is an issue of being "idle" (no activity for a certain period) then on your local machine set up your ssh config to use ServerAliveInterval
and ServerAliveCountMax
. There are server setting like ClientAliveInterval too if we get those set up. The idea is that ssh will send some traffic across the connection every ServerAliveInterval
seconds. And it won't give up even if it doesn't get a response for ServerAliveCountMax
times. Of course if your connection is actually broken it will eventually time out.
Set it in ~/.ssh/config (will apply user-only) or /etc/ssh/ssh_config (will apply system-wide):
host *
KeepAlive yes
ServerAliveInterval 60
ServerAliveCountMax 120
# optional alias for easier login
host nickname
hostname 37.27.34.9
username root
You can change "nickname" to whatever you want to type instead of the ip number.
I'm not sure if the KeepAlive
line is necessary. But it is in my config.
For all_pairs_bellman_ford_path also the chunking is better than non-chunking versions, but it might be better to have non-chunking because it's unlikely someone would want all the dict :
[EDIT] Note: The above plot is incorrect. I'm trying to figure out the right way to integrate node_chunking in a function that returns a generator. Earlier I was using return
inside _calculate_shortest_paths_subset
and that just returned the shortest paths for the first node in the chunk. And we cannot use yield
here either because then we get TypeError: cannot pickle 'generator' object
. And returning a dictionary with return_as="generator"
might not be the right way to do it and it also gives a wrong output,
def _calculate_shortest_paths_subset(nodes):
d={}
for source in nodes:
d[source]= single_source_bellman_ford_path(G, source, weight=weight)
return d
I see that the joblib context manager can maybe allow easy (without backend parameters) control of the number of jobs without changing the machine characteristics.
The parallel_config() context manager helps selecting a specific backend implementation or setting the default number of jobs:
This example appears in the joblib docs
from joblib import parallel_config
with parallel_config(backend='threading', n_jobs=2):
Parallel()(delayed(sqrt)(i ** 2) for i in range(10))
# output: [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]
The text immediately after the example says:
The latter is especially useful when calling a library that uses [joblib.Parallel](https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html#joblib.Parallel) internally without exposing backend selection as part of its public API.
So, maybe we can manage the number of jobs this way...
threading
backend, then assuming that n_jobs
is the number of cores is not good(bcoz n_job=number of threads here)What is the practical difference between number of cpus and number of threads? If I split my nodes into 4 chunks and start processing each chunk in a separate thread, is that different (from the nx-joblib perspective) from starting each one on a separate cpu?
What are your overall results from this experiment timing node_chunking? Are there any lessons learned that we can use when deciding how to chunk a set of nodes?
Also, do you have a remote computer to run these on still? How is that connection timeout going?
What is the practical difference between number of cpus and number of threads? If I split my nodes into 4 chunks and start processing each chunk in a separate thread, is that different (from the nx-joblib perspective) from starting each one on a separate cpu?
yes, there is a difference. Threads are the smaller units of execution within a process and we can have shared resources in threading(joblib's backend) and not in loky(joblib's default backend that uses processes). Managing threads typically incurs less overhead than managing separate processes because threads share resources. But, excessive threading can lead to conflicts or unexpected behavior(like multiple threads attempting to access a shared resource simultaneously). Also, Python has the GIL(Global Interpreter Lock) that protects access to Python objects, preventing multiple threads from executing Python bytecodes simultaneously in the same process. This means that even on multi-core machine, only one thread can execute Python bytecode at a time in a given Python interpreter process. (ref. https://docs.python.org/3/glossary.html#term-global-interpreter-lock)
My concern was that if we want to have a parameter "number_of_cpus" in benchmarking, then we should not treat it as equivalent to the "n_jobs" in joblib, firstly because it's not, and secondly because "number_of_cpus" suggests we are changing the hardware, which we are not(we are changing n_jobs). We should instead call that parameter "n_jobs" itself, to avoid any confusion. Let me know what you think.
What are your overall results from this experiment timing node_chunking?
I updated the first comment in this PR with benchmark results of 4 algos and for the rest 2:
all_pairs_bellman_ford_path
: It's not possible to generate a generator of generators in joblib(because the values returned from each parallel process/thread get pickled and you cannot pickle a generator object). So, I'm not sure what "chunking" even means for algorithms that return a generator or iterator. Also joblib does return a "normal" generator object when we set return_as="generator"
but it evaluates the generator in parallel, I think, because it takes less time to convert a generator, returned by the non-chunking, parallel implementation, to get converted into a dict
(using dict(paths)
) as compared to the conversion time of a generator returned by the sequential implementation. So, I was trying to understand how joblib generates a generator that they evaluate in parallel and I thought that might help me understand better what "chunking" would mean in this context.tournament.is_strongly_connected
took a lot of time, more than 20 mins for even 100 node graph.(i'm trying to figure out why)Are there any lessons learned that we can use when deciding how to chunk a set of nodes?
I don't think it would be fair to say anything based on just 4-5 algorithms. But I think we can say that chunking could be better than no chunking when we want to merge the results from each chunk afterwards and that merging is not trivial or inexpensive(based on betweenness_centrality and number_of_isolates). And I think we can discover more things as we add more algos. :)
Also, do you have a remote computer to run these on still? How is that connection timeout going?
I was able to increase the connection timeout by fixing the config you suggested in this comment. Thank you for that! And I added the results and html folders in this PR because I thought we will probably not merge this and it would be nice to have the results here if anyone would want to see them in the future. They are also in the VM.
Thank you :)
Thanks for the description of the difference between threads and jobs and cpus. I agree with you that we should be careful with the names of each concept. n_jobs
is different from number_of_cpus
.
About generators, IIUC, there are two ways joblib handles the generators: ordered and unordered. With ordered, the second generator has to wait until the first is done/stops before yielding its results. In the unordered case, objects are yielded in the order the processes complete their yielding with no waiting. Then the order things are yielded from the combined generator can change from run to run even with the same data. But my understanding of this is not from experience, just memories from reading the docs a few weeks ago. ;}
Quoting the doc:
Using return_as='generator' allows to progressively consume the outputs as they arrive and keeps the memory at an acceptable level. In this case, the output of the Parallel call is a generator that yields the results in the order the tasks have been submitted with. If the order of the tasks does not matter (for instance if they are consumed by a commutative aggregation function), then using return_as='generator_unordered' can be even more efficient.
If you meant to close this, just close it again. But you didn't say you were closing it, so i think it was closed with a button click.
All benchmarks were ran on (
machine.json
) :For the above 3, chunking is better than no chunking for all the cases. For the below one it doesn't seem to matter if we use chunking or not.