Neighbourhood iterators

rohitvarkey commented 7 years ago

Aimed at #14.

rohitvarkey commented 7 years ago

Performance on master.

julia> bench(10, 16)
INFO: Running BFS benchmark
BenchmarkTools.Trial:
  memory estimate:  138.96 mb
  allocs estimate:  611253
  --------------
  minimum time:     1.024 s (0.92% GC)
  median time:      1.071 s (0.94% GC)
  mean time:        1.076 s (0.94% GC)
  maximum time:     1.128 s (1.08% GC)
  --------------
  samples:          9
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

Performance on this branch.

julia> bench(10, 16)
INFO: Running BFS benchmark
BenchmarkTools.Trial:
  memory estimate:  913.46 mb
  allocs estimate:  18617182
  --------------
  minimum time:     414.427 ms (15.24% GC)
  median time:      432.521 ms (15.22% GC)
  mean time:        434.796 ms (15.21% GC)
  maximum time:     458.556 ms (15.03% GC)
  --------------
  samples:          20
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

We see around 50% speedup with the neighborhood iterators against the getsuccessors version. However, there are a lot more allocations going on.

I also had to comment out the check if the source was greater than the number of vertices as that was leading to type instabilities in the bfs function for some reason.

cc @jpfairbanks

jpfairbanks commented 7 years ago

that is a good improvement. I think the increase in allocations comes from the fact that types are allocated on the heap and immutables on the stack. Try switching all the stinger structures to immutables and see if that reduces allocations.

rohitvarkey commented 7 years ago

I changed the types to immutables and we get another 50% improvement on the previous times! So now it is around 5x faster than the version on master.

julia> bench(10, 16)
INFO: Running BFS benchmark
BenchmarkTools.Trial:
  memory estimate:  34.60 mb
  allocs estimate:  615462
  --------------
  minimum time:     203.119 ms (0.95% GC)
  median time:      221.513 ms (1.79% GC)
  mean time:        224.224 ms (1.58% GC)
  maximum time:     297.731 ms (1.45% GC)
  --------------
  samples:          36
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

rohitvarkey commented 7 years ago

I think this PR is ready to merged now. Can you take a final look at it @jpfairbanks?

jpfairbanks commented 7 years ago

That is great 5x over the initial implementation means we are within 1.5x of native stinger, right? What does that imply for our main research question? To show that it is possible to get high performance interaction between julia and a complex C library.

rohitvarkey commented 7 years ago

That is great 5x over the initial implementation means we are within 1.5x of native stinger, right? What does that imply for our main research question? To show that it is possible to get high performance interaction between julia and a complex C library

I'm working on running the benchmarks again on PACE so that we get a definitive answer on this.

rohitvarkey commented 7 years ago

I ran the benchmarks on PACE from 10 to 16 with the following results. Julia seems to be faster than C in most of the cases. This can be due to sampling error as we are only doing 1 run for each sample. Also, the Julia library is calling a different stinger library (the current dev branch) than the dynograph C benchmark (custom stinger). Performance improvements in the dev branch not incorporated into the dynograph stinger might also be a reason for Julia performing better in some cases.

Scale	Julia(ms)	C(ms)	Julia/C
10	347.778	308.193111	1.12844183593708
11	759.545	979.772495	0.775225885474566
12	1667	2205.979821	0.755673276849979
13	4116	5455.422772	0.754478648497309
14	8425	11761.607389	0.716313656913888
15	18021	23127.020348	0.77921840897928
16	35918	48674.672364	0.737919707633515

What does that imply for our main research question? To show that it is possible to get high performance interaction between julia and a complex C library.

This is pretty good for our research question as Julia seems to be performing on the order of C at this point. 😄

cc: @jpfairbanks @ehein6

ehein6 commented 7 years ago

What does "scale" mean? How big is the graph, and how many vertices are actually being touched during the BFS?

Eric

From: Rohit Varkey Thankachan notifications@github.com Sent: Tuesday, February 14, 2017 10:18:24 AM To: rohitvarkey/StingerWrapper.jl Cc: Hein, Eric R; Mention Subject: Re: [rohitvarkey/StingerWrapper.jl] Neighbourhood iterators (#19)

I ran the benchmarks on PACE from 10 to 16 with the following results. Julia seems to be faster than C in most of the cases. This can be due to sampling error as we are only doing 1 run for each sample. Also, the Julia library is calling a different stinger library (the current dev branch) than the dynograph C benchmark (custom stinger). Performance improvements in the dev branch not incorporated into the dynograph stinger might also be a reason for Julia performing better in some cases.

Scale Julia C Julia/C 10 347.778 308.193111 1.12844183593708 11 759.545 979.772495 0.775225885474566 12 1667 2205.979821 0.755673276849979 13 4116 5455.422772 0.754478648497309 14 8425 11761.607389 0.716313656913888 15 18021 23127.020348 0.77921840897928 16 35918 48674.672364 0.737919707633515

What does that imply for our main research question? To show that it is possible to get high performance interaction between julia and a complex C library.

This is pretty good for our research question as Julia seems to be performing on the order of C at this point. 😄

cc: @jpfairbankshttps://github.com/jpfairbanks @ehein6https://github.com/ehein6

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rohitvarkey/StingerWrapper.jl/pull/19#issuecomment-279735870, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADuF7JuaZ0kPkBbf7hBxCz4fM3d3kvwsks5rccXAgaJpZM4L6sIk.

jpfairbanks commented 7 years ago

This is great news!

Make sure you include the PBS script into the repo. Did you set -lnodes=1,ppn={1,2,4,8,16} and OMP_NUM_THREAD==ppn? Julia being faster than C makes me think that C is running with OMP_NUM_THREADS>1 but actually only using 1 thread.

Can you add a column for the master branch showing julia going through collection of neighbors interface? Also testing up to scale 20 would be necessary for publication.

@ehein, are there changes in dynograph that should account for this performance diff?

On Feb 14, 2017, at 10:18 AM, Rohit Varkey Thankachan notifications@github.com wrote:

I ran the benchmarks on PACE from 10 to 16 with the following results. Julia seems to be faster than C in most of the cases. This can be due to sampling error as we are only doing 1 run for each sample. Also, the Julia library is calling a different stinger library (the current dev branch) than the dynograph C benchmark (custom stinger). Performance improvements in the dev branch not incorporated into the dynograph stinger might also be a reason for Julia performing better in some cases.

Scale Julia C Julia/C 10 347.778 308.193111 1.12844183593708 11 759.545 979.772495 0.775225885474566 12 1667 2205.979821 0.755673276849979 13 4116 5455.422772 0.754478648497309 14 8425 11761.607389 0.716313656913888 15 18021 23127.020348 0.77921840897928 16 35918 48674.672364 0.737919707633515 What does that imply for our main research question? To show that it is possible to get high performance interaction between julia and a complex C library.

This is pretty good for our research question as Julia seems to be performing on the order of C at this point. 😄

cc: @jpfairbanks https://github.com/jpfairbanks @ehein6 https://github.com/ehein6 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rohitvarkey/StingerWrapper.jl/pull/19#issuecomment-279735870, or mute the thread https://github.com/notifications/unsubscribe-auth/ACoFzh-UDdDzbgpYXZlo4SMTth9mzxYWks5rccW_gaJpZM4L6sIk.

James Fairbanks james.fairbanks@gatech.edu

rohitvarkey commented 7 years ago

@ehein6 I am using a kronecker graph generator to generate the input graph. The implementation can be found here. The graph has 2^scale verticles and I'm using an edgefactor of 16. The benchmark runs BFS with the first 1000 vertices as source (same as the Julia benchmark).

Pasting the entire output from dynograph

{"batch":0,"epoch":0,"num_edges":8031,"num_traversed_edges":[5471851],"num_vertices":1023,"region_name":"bfs","source_vertex":1021,"time_ms":308.193111,"trial":0}
{"batch":2,"epoch":0,"num_edges":23690,"num_traversed_edges":[17345293],"num_vertices":2047,"region_name":"bfs","source_vertex":2046,"time_ms":979.772495,"trial":0}
{"batch":5,"epoch":0,"num_edges":49442,"num_traversed_edges":[35088743],"num_vertices":4095,"region_name":"bfs","source_vertex":4091,"time_ms":2205.979821,"trial":0}
{"batch":12,"epoch":0,"num_edges":110223,"num_traversed_edges":[75862069],"num_vertices":8191,"region_name":"bfs","source_vertex":8186,"time_ms":5455.422772,"trial":0}
{"batch":25,"epoch":0,"num_edges":226923,"num_traversed_edges":[151372146],"num_vertices":16382,"region_name":"bfs","source_vertex":16381,"time_ms":11761.607389,"trial":0}
{"batch":51,"epoch":0,"num_edges":464849,"num_traversed_edges":[282009015],"num_vertices":32766,"region_name":"bfs","source_vertex":32765,"time_ms":23127.020348,"trial":0}
{"batch":103,"epoch":0,"num_edges":948289,"num_traversed_edges":[575443304],"num_vertices":65534,"region_name":"bfs","source_vertex":65526,"time_ms":48674.672364,"trial":0}

rohitvarkey commented 7 years ago

@jpfairbanks I just ran it in a interactive session. Just wanted to get some rough benchmark numbers. I'll work on a PBS script and run it till 20 tonight. I haven't set OMP_NUM_THREADS explicitly, so you're probably right about C running with only one thread.

jpfairbanks commented 7 years ago

Ok when you write the PBS script you need to set the number of cores using the ppn variable. In the script you need to set the number of OMP threads to match the number of cores used.

tutorial: https://kb.iu.edu/d/avmy

We should have the numbers all the way to scale 20 by our meeting tomorrow.

ehein6 commented 7 years ago

Some notes about making sure the C benchmark is running at peak performance...

Make sure you are building in Release mode, set -DCMAKE_BUILD_TYPE=Release when you run cmake to configure the build. Otherwise the compiler does no optimizations
The fact that num_traversed_edges is printing as a single value instead of an array confirms that C stinger was running with only one thread.
There aren't any performance changes relevant to BFS that I know of between DynoGraph's version of stinger and the main dev branch.
I recently optimized the hooks that are used to calculate num_traversed_edges at runtime so they are more likely to be inlined. If possible, you might try upgrading to the latest version of stinger-dynograph (though I know that will probably be a nasty merge conflict at this point.)

Eric

From: James notifications@github.com Sent: Tuesday, February 14, 2017 10:37:41 AM To: rohitvarkey/StingerWrapper.jl Cc: Hein, Eric R; Mention Subject: Re: [rohitvarkey/StingerWrapper.jl] Neighbourhood iterators (#19)

Ok when you write the PBS script you need to set the number of cores using the ppn variable. In the script you need to set the number of OMP threads to match the number of cores used.

tutorial: https://computing.llnl.gov/tutorials/moab/

We should have the numbers all the way to scale 20 by our meeting tomorrow.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rohitvarkey/StingerWrapper.jl/pull/19#issuecomment-279741834, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADuF7OGboEl3mIzW_GMmOLG_lf9xpaFlks5rccpFgaJpZM4L6sIk.

rohitvarkey commented 7 years ago

Thanks @ehein6. I have a couple more questions regarding the settings.

Do you have a recommendation for the value I should set as batch size? I just used 10000 for now.

I recently optimized the hooks that are used to calculate num_traversed_edges at runtime so they are more likely to be inlined. If possible, you might try upgrading to the latest version of stinger-dynograph (though I know that will probably be a nasty merge conflict at this point.)

The version that I compiled was the latest stinger-dynograph. I just made the same change I made earlier to the bfs benchmark in the stinger packaged with it. But, I wasn't able to use this version of stinger in the Julia wrapper as it compiled only archive (.a) files instead of .so libraries. Is there a flag I can use to create shared objects so that I can use these from Julia for a fair comparison?

Make sure you are building in Release mode, set -DCMAKE_BUILD_TYPE=Release when you run cmake to configure the build. Otherwise the compiler does no optimizations

Does this apply to a general Stinger build also?

ehein6 commented 7 years ago

Do you have a recommendation for the value I should set as batch size? Batch size controls how the graph is constructed, which isn't relevant for your experiments. You can pass sort_mode=snapshot to bypass batch loading and load everything at once.

Is there a flag I can use to create shared objects so that I can use these from Julia for a fair comparison?

There is a cmake variable that controls this, BUILD_SHARED_LIBS. This variable was being ignored in mainline stinger. I have a pull request that fixes this (https://github.com/stingergraph/stinger/pull/216) which is integrated into stinger-dynograph. The old hooks needed libstinger_core to be static to link with. I think the new edge count hooks should work with dynamic linking but I haven't tried this yet.

Does this apply to a general Stinger build also?

Yes, you should also build stinger in release mode.

Eric

From: Rohit Varkey Thankachan notifications@github.com Sent: Tuesday, February 14, 2017 10:59:27 AM To: rohitvarkey/StingerWrapper.jl Cc: Hein, Eric R; Mention Subject: Re: [rohitvarkey/StingerWrapper.jl] Neighbourhood iterators (#19)

Thanks @ehein6https://github.com/ehein6. I have a couple more questions regarding the settings.

Do you have a recommendation for the value I should set as batch size? I just used 10000 for now.

I recently optimized the hooks that are used to calculate num_traversed_edges at runtime so they are more likely to be inlined. If possible, you might try upgrading to the latest version of stinger-dynograph (though I know that will probably be a nasty merge conflict at this point.)

The version that I compiled was the latest stinger-dynograph. I just made the same change I made earlier to the bfs benchmark in the stinger packaged with it. But, I wasn't able to use this version of stinger in the Julia wrapper as it compiled only archive (.a) files instead of .so libraries. Is there a flag I can use to create shared objects so that I can use these from Julia for a fair comparison?

Make sure you are building in Release mode, set -DCMAKE_BUILD_TYPE=Release when you run cmake to configure the build. Otherwise the compiler does no optimizations

Does this apply to a general Stinger build also?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/rohitvarkey/StingerWrapper.jl/pull/19#issuecomment-279748877, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADuF7GZ6kLlhju6Jl2cT2q4OdYNfBzSLks5rcc9fgaJpZM4L6sIk.

rohitvarkey commented 7 years ago

These are the results of the latest benchmarks I ran.

OPM_NUM_THREADS=4, Stinger built in Non release mode

Scale	Julia	C	Julia/C
10	339.292	340.953631	0.995126519124825
11	730.734	938.893285	0.778292923886446
12	1574	1834.756013	0.857879733788887
13	3792	4592.596247	0.82567676234919
14	8734	8862.142198	0.985540494032141
15	18295	17323.834421	1.05605950480702
16	36559	35110.566023	1.04125350688027
17	74056	65552.473523	1.12972090937219
18	159295	130430.953979	1.22129751520216
19	296228	284814.942753	1.04007183449254
20	672909	610770.537291	1.1017378195494

OPM_NUM_THREADS=4, Stinger built in release mode

Scale	Julia	C	Julia/C
10	190.554	94.104422	2.02492078427515
11	390.257	243.868121	1.60027886547746
12	787.654	458.297184	1.71865337056053
13	1578	955.802061	1.65096944690518
14	3329	1935.527004	1.71994500367095
15	9766	4533.5319	2.15417035005312
16	22578	9688.3277	2.33043314585653
17	44756	18595.688007	2.40679452049058
18	84104	36695.128629	2.29196634927539
19	172049	77780.002164	2.21199531001854
20	358477	165515.153103	2.16582586717556

cc: @jpfairbanks @ehein6

jpfairbanks commented 7 years ago

So the julia code is single threaded within 4x of the multithreaded C implementation? That is pretty awesome! For the paper we should point out that we can get a faster serial implementation.

Scale	Serial C	Serial Julia / Serial C	Multithreaded C / Serial C	`@threads` Julia / Serial C
16	.	.	.	.
...	.	.	.	.
22	.	.	.	.

It is good that we should be beating the Serial C stinger implementation but not beating the parallel C stinger implementation. If serial julia could beat parallel C then we would have a bad parallel C implementation.

rohitvarkey commented 7 years ago

Merging this.

stingergraph / StingerGraphs.jl

Neighbourhood iterators #19