Closed Schefflera-Arboricola closed 8 months ago
I wanted some guidance on how we should decide to include more parameters. Like I was thinking to include the following params
and param_names
:
graph_type : ["undirected", "directed", "multigraph”]
is_weighted : [weighted, unweighted]
graph_generation : [“fast_gnp_random_graph”, “erdos_renyi_graph”, “watts_strogatz_graph”, “barabasi_albert_graph”]
But, when should we include a new parameter, when it’s very different(in terms of time) from the undirected, unweighted graph made by nx.fast_gnp_random_graph
or we should include different graph_types
(and other parameters) just because that algorithm can take in different graph types?
Thank you :)
[EDIT] I tried running benchmarks for graph
["directed", "undirected"]
and is_weighted
["True", "False"]
(I used random.random()
to assign random weights to edges and G = nx.fast_gnp_random_graph(num_nodes, edge_prob, seed=42, directed=True)
to create random directed graphs)there wasn't a lot of variation from the undirected, unweighted graph at least for betweenness_centrality
(as seen in the following ss) :
negligible variation in directed and undirected graphs
negligible variation in weighted and unweighted graphs
all benchmarks
Replying to Some big picture questions
comment
bench_
file corresponds to a folder/file in the networkx/algorithms directory in NetworkXbench_
file corresponds to every file in a folder(one class if it’s a file) in networkx/algorithmsx
in bench_x
corresponds to the folder name(class name and x
are same if it’s a file in networkx/algorithms)time_
function corresponds to each function in the file.generators
, classes
, linalg
, utils
etc. we can have different bench_
files for each of them having different classes corresponding to different files in each of these folders.should we use a class here? Does asv require it? It seems like we are really just setting up a function (sorry if I am just forgetting what asv needs here...)
I have used classes just to keep it all organised.
Is there another scientific-python oriented repo that has an established directory structure for benchmarks that we could follow?
I will see if there is a better way of organizing. And, please let me know what you think of the current way.
Thank you :)
@dschult I've made the suggested updates pls let me know this looks good to you.
@dschult thank you for the detailed and insightful review!
yeah, it's better if we have graph creation and timing functions separate. I guess I just did it to make the code more compact, thinking that the graph creation time will be included for both the algo_types
(but, that won't be correct timing for the functions, now that I think about it. Thanks for pointing it out!). Also, there might be some cases where we might need different initial graphs for different functions in the same class. Then I think we can have if
conditions in setup function.
About the base Benchmark
class, I was referring to the scipy and the numpy libraries and they have a base class(with just a pass
statement) for all benchmarks. But, in pandas, there is no common base class. So, I decided to have a common base class for all the benchmark classes just in case in the future we would like to add something to it. Please let me know your thoughts on this.
Also, I agree using backend="parallel"
is better than H = nxp.ParallelGraph(G)
, because it does take some extra time to convert it to a ParallelGraph
(not a lot, but around 0.00000405... seconds). Also, this way we won't have to have theif algo_type ==*
condition and the timing_func
will then become just a one-line function, so we can get rid of it.
Also, I think it will be better to make it backends
instead of algo_types
.
Thank you :)
Also, I think we should keep the get_graph
function because then the graphs will be saved in the cache(because of @lru_cache(typed=True)
) so it would take less time when we will be running all the benchmarks. Pls let me know your thoughts on this :)
I expect that we will need many different types of graphs for the algorithms. So the get_graph
May not be able to be used very often. At least that is what happens with tests and pytest. We create a test class and put some graphs that could be used by lots of tests, but then they end up only being used by a couple of the tests and the rest construct their own graph. So we may want to create a different setup function for many of the timing functions. I think we’ll have to wait to see what is most useful. Keeping it as a common function is fine with me. Could we name it something like: cached_gnp_random_graph
?
Does caching work well with floating point inputs like p
? Nice binary values like 0.25 should work, but round-off can occur with 0.3, e.g. I guess if the roundoff is the same for each call of the function we’re ok.
In general, caching is nice for time consuming parts that either take a lot of time, or get repeated many times. We’ll see as we get more examples whether the timing is taking a long time and how much of that time is constructing the graphs. Let’s leave it in for now.
Does caching work well with floating point inputs like p? Nice binary values like 0.25 should work, but round-off can occur with 0.3, e.g. I guess if the roundoff is the same for each call of the function we’re ok.
if the combination of arguments is the same then the function is not implemented and the graph is returned from the cache directory, it is independent of the type of the argument.
Ahh.. But is 0.3 and 0.29999999999999 the same? How does the caching software determine that the values are the same? It looks like lru_cache
doesn't reliably cache for floating points (due to roundoff) unless the value is created in the same way each time. I think this will work for us so long as the value from e.g. edge_prob
is not recalculated a different way (e.g. by range(0.1, 1.0, 0.1)
)
@functools.lru_cache
def f(x):
print(x)
f.cache_info()
# CacheInfo(hits=0, misses=0, maxsize=128, currsize=0)
f(3/10)
# 0.3
f(1/10*3)
# 0.30000000000000004 (round-off)
f.cache_info()
# CacheInfo(hits=0, misses=2, maxsize=128, currsize=2) <-- 3/10 and 1/10*3 are cached as different results
I think it's ready. Just made a few changes in README :)
issue https://github.com/networkx/nx-parallel/issues/12
here I implemented a basic benchmarking infrastructure using ASV for comparing the
parallel
andnormal
implementations of thebetweenness_centrality
function. I couldn’t figure out how to run it for different number of nodes and edge probabilities, we would probably have to define our own benchmarking classes for that.The steps to run the benchmarks are in README.md (result ss) :
Please give your feedback to better it.
Thank you :)
References: ASV docs, networkx benchmarking, numpy benchmarking