nest / nest-simulator

The NEST simulator
http://www.nest-simulator.org
GNU General Public License v2.0
519 stars 358 forks source link

NEST as a SPEC CPU benchmark #3217

Open heshpdx opened 3 weeks ago

heshpdx commented 3 weeks ago

Hello friends,

I’m a CPU architect at Ampere Computing where I do performance analysis and workload characterization. I also serve on the SPEC CPU committee, working on benchmarks for the next version of SPEC CPU, CPUv8 . We try to find computationally intensive workloads in diverse fields, to help measure performance across a wide variety of behaviors and application domains. Based on the longevity of nest, its large active community in biology and its use in education, I have proposed the nest neural network model be included in the next set of marquee benchmarks in SPEC CPU.

As part of the effort, we have ported and integrated the nest mainline code into the SPEC CPU harness so that it can be tested on a wide variety of systems in a controlled environment to produce reproducible results. We have even built it on native Windows using MSVC and the Intel compiler for Windows – we are happy to share the changes if someone is interested in testing and integrating it back into the upstream mainline for the benefit of the community.

The piece we need help with is an understanding of the multithreaded workloads. Right now, we have single-threaded nest command lines which run and produce verifiable output across many compilers (llvm, gcc, icc, aocc, nvhpc, cray), ISAs (aarch64, x86, power) and operating systems (linux, windows, android). We verify the run via checking the .dat files which come out of the simulation runs to make sure that there are no differences in the resulting output. A problem arises when we run with multiple threads, since there are a different number of files produced, and I am unfamiliar with how to coalesce them to verify.

First some fundamental questions: Does a nest invocation with 8 threads perform the same amount of work as a run with 16 threads? Or is it that the problem being solved is larger? If it is the same, how can we verify that? Does this answer change based on the .sli script used?

In the example below, if I run examples/nest/brunel-2000_newconnect_dc.sli (with a small edit to make it run longer)... I tried with 8 threads and 16. It looks like I am simulating the same number of Neurons and Synapses. The 8-thread version outputs 16 files and the 16-threaded version outputs 24 files. The total lines in the files are close. Do they fundamentally contain the same information, just at different sample points?

$ ./nest_s_base.O3-64 --userargs=threads=8 brunel-2000_newconnect_dc_LONG.sli
NEST 3.5.0-post0.dev0 (C) 2004 The NEST Initiative
Configuring neuron parameters.
Creating the network.
Configuring neuron parameters.
Creating excitatory spike recorder.
Creating inhibitory spike recorder.
Connecting excitatory neurons.
Connecting inhibitory population.
Connecting spike recorders.
Starting simulation.

Brunel Network Simulation
Number of Threads : 8
Number of Neurons : 95000
Number of Synapses: 902500100
       Excitatory : 722000000
       Inhibitory : 180500000
Excitatory rate   : 6.99 Hz
Inhibitory rate   : 6.825 Hz
Building time     : 23.76 s
Simulation time   : 448.07 s
$ ls -l brunel-2-in-threaded-95002-* | wc -l
16

$ wc -l brunel-2-in-threaded-95002-* | tail -1
 2811 total
$ ./nest_s_base.O3-64 --userargs=threads=16 brunel-2000_newconnect_dc_LONG.sli
NEST 3.5.0-post0.dev0 (C) 2004 The NEST Initiative
Configuring neuron parameters.
Creating the network.
Configuring neuron parameters.
Creating excitatory spike recorder.
Creating inhibitory spike recorder.
Connecting excitatory neurons.
Connecting inhibitory population.
Connecting spike recorders.
Starting simulation.

Brunel Network Simulation
Number of Threads : 16
Number of Neurons : 95000
Number of Synapses: 902500100
       Excitatory : 722000000
       Inhibitory : 180500000
Excitatory rate   : 6.985 Hz
Inhibitory rate   : 7.36 Hz
Building time     : 35.39 s
Simulation time   : 332.74 s
$ ls -l brunel-2-in-threaded-95002-* | wc -l
24

$ wc -l brunel-2-in-threaded-95002-* | tail -1
 2909 total

Overall, the goal is to be able to verify that the same amount of work was completed between these two command lines, and verify that they calculated the same result. This allows a benchmark to run on systems with a varying number of hardware cores, so we can measure CPU performance between them. We are allowed to provide some tolerance, in case there is floating point rounding error.

For the multithreaded benchmark, I am exercising the scripts below. The goal is to showcase scalable threading performance, as well as cover a variety of behaviors in the nest simulator.

brunel-2000_newconnect_dc_LONG.sli
brunel_ps_LONG.sli
hpc_benchmark.sli
microcircuit.sli

If you have feedback on which are more or less useful as multithreaded benchmarks, please share your thoughts!

Thank you!

heplesser commented 2 weeks ago

Hello Mahesh!

We are very excited that you have proposed NEST for inclusion in the SPEC CPUv8 benchmark suite! We would be happy to work with you to make this happen. I will answer your more specific questions below.

One important point about NEST (and other neuronal network simulators) is that you can run a wide range of neuronal network models on the NEST simulator, constructing the networks either through SLI or through Python scripts. And if networks are defined in the PyNN specification language, they can be executed on a range of neuronal network simulators, including some neuromorphic systems. Therefore, there isn't "the nest neural network model". The advantage of this is that one can configure networks that are suitable for benchmarking.

In our own work, we have mainly used the hpc_benchmark.sli and the microcircuit.sli benchmarks. The latter has become a reference benchmark also for neuromorphic and GPU-based simulators (I will post references later).

Concerning the specific Brunel-benchmark you used: Increasing the number of threads will distribute the same workload across more threads, i.e., strong scaling.

I am a bit confused by the number of output files you report. When running with eight threads, I would expect eight brunel-2-ex-threaded-... and eight brunel-2-ex-threaded-... files, and with 16 threads correspodingly 16 ex and 16 in files. Could you double check that?

I noticed that you used a NEST 3.5.0-post0.dev0 version from Github. We made some substantial improvements in threaded scaling in NEST 3.6, so I would strongly suggest that you update to at least NEST 3.6, ideally to NEST 3.7. With NEST 3.6, we see excellent scaling for the "microcircuit" benchmark up to 128 threads on dual AMD Epyc Rome systems.

Best, Hans Ekkehard

heshpdx commented 2 weeks ago

Thanks for responding, Hans!

I had a feeling you would ask me to rebase, so I attempted that three-way merge right after I posted above. The SPEC CPU harness builds applications in a totally different manner, so the process of adding a benchmark requires taking humpty-dumpty apart and putting him back together. Something may have gotten lost in that process, because after my merge, I get this runtime error immedately:

$ ./nest_r_base.O3-64 

sli-init Fatal []: 
    While executing module initializer: {(nest-init) run}

load Error []: UndefinedName

Start Error []: 
    Something went wrong during initialization of NEST or one of its modules. 
    Probably there is a bug in the startup scripts. Please report the output 
    of NEST at https://github.com/nest/nest-simulator/issues . You can try to 
    find the bug by starting NEST with the option --debug

Start Error []: 
    Something went wrong during initialization of NEST or one of its modules. 
    Probably there is a bug in the startup scripts. Please report the output 
    of NEST at https://github.com/nest/nest-simulator/issues . You can try to 
    find the bug by starting NEST with the option --debug

Here is the output from ./nest --debug: debug.log

Can you provide some hints as to what I should look at? I am building all the modules as well as models/models.cpp and models/modelsmodule.cpp. Is there a config or other flag I should be aware of?

heshpdx commented 2 weeks ago

On the topic of workloads, thank you for the feedback on the benchmark scripts.

When running with eight threads, I would expect eight brunel-2-ex-threaded-... and eight brunel-2-ex-threaded-... files, and with 16 threads correspodingly 16 ex and 16 in files. Could you double check that?

Indeed, you are correct! I counted wrong, sorry about that. So the question is, how can I use those files to prove the same work occurred between the two runs, and check that they both came to the "correct answer"? SPEC CPU benchmarks need to scale from 1 thread to N threads, and verify the results are the same using any N. This is the challenge in front of us.

For the single-threaded benchmark, I am using these scripts which seem to work well. Are they ok?

cuba_stdp.sli
structural_plasticity_benchmark.sli
ArtificialSynchrony.sli
heshpdx commented 2 weeks ago

Ahoy! Sometimes crafting a detailed message to the author is all that is required to solve one's problem. In making the debug.log and looking at the first line of the output, I realized that I had forgotten to update the nest/sli/sli-init.sli and other runtime files to your latest mainline. I did that and the program was able to run again! I will work on testing all my cmdlines with my 3.7 rebase. Thanks!

heshpdx commented 2 weeks ago

Regarding microcircuit.sli, we noticed that there is a bug if we try to run with more than 63 threads. It has something to do with a file not being able to be created. We got around that issue by capping the max thread count in that script. Is this a known limitation?

diff -u Potjans_2014/microcircuit.sli ~/867.nest_s/data/refspeed/input/microcircuit.sli
--- Potjans_2014/microcircuit.sli     2024-06-07 16:31:55.464111605 +0000
+++ ~/867.nest_s/data/refspeed/input/microcircuit.sli   2024-05-10 16:28:22.329645268 +0000
@@ -137,6 +137,14 @@

 } def

+% This model only supports up to 63-way concurrency
+/maxthreads 63 def
+/Truncate63
+{
+  dup maxthreads gt {
+     pop maxthreads
+  } if
+} def

 % PrepareSimulation - Set kernel parameters and RNG seeds based on the
 % settings in sim_params.sli.
@@ -148,9 +156,9 @@
     % set global kernel parameters
     <<
        /resolution dt
-       /total_num_virtual_procs n_vp
+       /local_num_threads local_num_threads Truncate63
+       /total_num_virtual_procs n_vp Truncate63
        /overwrite_files overwrite_existing_files
        /rng_seed rng_seed
        output_path (.) neq {
            /data_path output_path
        } if
heplesser commented 2 weeks ago

@heshpdx NEST creates a separate spike recorder instance for each thread to allow non-blocking recording and each of these instances opens a separate file. Thus, the brunel....sli script running on 64 threads would open 128 output files. Could it be that you hit an operating system limit on the number of open files per process?

I presume that with SPEC CPU you want to test CPU performance rather than I/O system performance. If this is indeed the case, I would suggest to drop the spike time recording to file. This would solve the problem with limited numbers of files. If SPEC rules should allow you to do one run with some form of recording of spikes and another run without it, one could use the following approach: Perform a run with spike recorders in which the spikes are recorded to memory. After then simulation time is up, extract spike data from the spike recorder in the SLI script and write it to file. This only requires one file per process instead of one per thread. Also, read out the spike counter in NEST. Then turn off spike recording completely and re-run the simulation. Read out the spike counter (it is always active) and you should get exactly the same number of spikes as when recording. I can send you code to do this.

When simulating the same network with different numbers of threads, you will obtain results that differ in detail, since the random number sequences in NEST are thread-specific. We have developed several measures to verify that simulations do produce statistically consistent results, see the paper by Gutzen et al below. If you positively need identical results independent of the number of threads used, we can develop a work-around for that.

For strong-scaling experiments, I would encourage you to use the microcircuit model, as it is currently the most widely used network model used for benchmarking, see the paper by Kurth et al below. That paper also gives a rather recent description of the state of the art, although NEST 3.6 and later now outperform NEST 2.14 that was used in that paper.

Concerning a single-thread benchmark, what are the constraints on running a single benchmark? On my MacBook with Core i5, hpc_benchmark.sli takes in total about two minutes for a single run. If that is acceptable, I would suggest using hpc_benchmark.

heplesser commented 2 weeks ago

And here the promised references:

heshpdx commented 2 weeks ago

You are correct we want to be CPU bound and not IO bound. But even with the current code, outputting a file per thread, we are quite CPU bound.

Your spike recorder in memory idea is very good. If you can coalesce that using sli directives at the end of the simulation, that would solve our problem. The one limitation we have is that the run should fit within 64 GB of virtual memory (for the parallel runs). I imagine that is big enough.

Please do work on that when you get a chance. I recognize that the NEST conference is coming up and you will be busy with that!

The thread count limitation was only seen in microcircuit.sli. I'll see if I can dig up why. Most OSs have a default open file size limit of 4096 so I think the issue may be more esoteric.

I'll take a look at hpc benchmark for single threaded runs. I wanted to have a wide variety of code coverage and CPU behavior in three or four workloads. The time limit for single threaded is 3-5 minutes on a "modern CPU" and it needs to fit within 1.8 GB of memory.

heshpdx commented 2 weeks ago

Regarding exact answers - we do allow some tolerance based on floating point rounding. If there is a lot of randomness, we can try to reduce that via more deterministic randomness. I've seen that in many benchmark candidates, less (or zero) randomness still provides valid answers and doesn't detract from creating a benchmark representative of the application behavior in the field.

With my latest rebase I still get issues with running Potjans_2014/microcircuit.sli with more than 63 threads (in the SPEC CPU harness):

$ ./nest_s_base.O3-64 --userargs=threads=80 microcircuit.sli
NEST 3.7.0-post0.dev0 (C) 2004 The NEST Initiative

SimulationManager::set_status Info []: 
    Temporal resolution changed from 0.1 to 0.1 ms.

NodeManager::prepare_nodes Info []: 
    Preparing 79729 nodes for simulation.

RecordingBackendASCII::prepare() Error []: 
    I/O error while opening file 'spikes_3_0-74246-63.dat'.

Simulate_d Error []: IOError

I have enough file descriptors:

$ ulimit -Sn
1024
$ ulimit -Hn
1048576

I played around with the mainline nest build, and realized the issue is not with threads, but with /total_num_virtual_procs. I was able to recreate the problem on the mainline by increasing Potjans_2014/microcircuit.sli's /total_num_virtual_procs to be 80 and it fails the same way as above. Can you share the correct way to scale up the parallelism in microcircuit.sli? What is your cmdline and how would should I set the parallelism to 32, and to 80?

heplesser commented 2 weeks ago

I had another look at microcircuit.sli and it creates a lot of files. The model consists of four cortical layers with two populations each, i.e., a total of eight populations. For each population, the script creates one spike recorder and one voltmeter, thus a total of 16 recording devices (implemented like this because it makes analysis easier later as one does not need to split data into populations later). Each of these devices is replicated once per thread, so for 64 threads you would get 1024 files, for 80 threads 1280 and thus hit your soft limit for file descriptors. To turn this off, open sim_param.sli and set the three variables /save_... to false (around line 92). Spikes will then be stored in memory and no files should be opened.

Concerning setting the number of threads, I am surprised that --userargs=threads=80 had an effect. Did you make some changes to microcircuit.sli to make this work? Since NEST combines MPI- and thread-based parallelization, we have

M is defined via mpirun. Inside a NEST script you can either set T through the parameter local_num_threads or N_VP through the parameter total_num_virtual_procs, which must be divisible by M. Since you work with M=1 only, both approaches are equivalent and you can stick with setting total_num_virtual_procs as is used microcircuit.sli. NEST guarantees that one gets identical results for fixed N_VP, no matter how it is split between M and T.

The set of files in Potjans_2014 assumes that parameters are modified in sim_params.sli and network_params.sli, respectively, but I assume that for your purposes, you prefer to set parameters on the command line?

One more question: We are mostly using Python scripts in our benchmarking now. Would that be feasible for your benchmarks as well or does that introduce to many confounding factors so that you would want to avoid Python?

heshpdx commented 2 weeks ago

Oh yes, I did change microcircuit.sli. In the SPEC CPU harness, a variable is passed down indicating how many threads to use. So I need to employ that on a command line, and that is the amount of parallelism that will be requested. I didn't know the difference between threads per rank and virtual processors. For us, we get one invocation of the command line, and that one process can only spawn up to N threads. MPI is not used, python is not used. This is all C++ and cmdline, which is why the sli scripts work so well. OpenMP is also allowed, but I opted to throw in the threads variable along the cmdline and modify the scripts to support it, or so I thought! Getting python inside the harness is difficult or impossible, so I would prefer not to change now. We are already have things working very well with the sli. So one thing you can help with is to craft the Potjans scripts in the proper way, so it could run in the harness with N threads specified on the cmdline. Is that possible? I can use that solution to make sure I am doing the right thing for hpc_benchmark. Thanks!

heplesser commented 2 weeks ago

Thanks for the explanations. I will create a version of microcircuit.sli that will allow you to set the number of threads on the command line and also to choose whether to record spikes or not, as well as the size of the network simulated. You could then use a downscaled version (10–20% of the full model) for the single-threaded case and the full model for scaling.

Two more question concerning the SPEC CPU setup: Can you use thread-aware allocators such as jemalloc and can you pin threads to cores? We have seen significant benefits of both in our benchmarks. In one case, we even found that we needed to change BIOS settings to ensure consistent pinning.

heshpdx commented 2 weeks ago

Thank you! Yes these are the two main issues. Outputting all the data into a fixed number of files will solve the problem.

Thread affinity is allowed, but that is done at a higher level in the harness, so the user must choose the same pinning scheme for all benchmarks.

Custom memory allocators are also allowed. Some benchmarks show gains when linking with jemalloc or tcmalloc, others don't.

heplesser commented 2 weeks ago

Hi @heshpdx!

I have now created a modified version of microcircuit.sli which you can find here:

https://github.com/heplesser/nest-simulator/blob/fix-3217-spec/examples/nest/Potjans_2014/microcircuit_spec.sli

It is monolithic, i.e., you need only this one SLI file. Parameters can be passed on the command line like this (all optional):

nest --userargs=threads=8:scale=0.4:seed=12345 microcircuit_spec.sli

It does not write any spikes to files, but write a short report at the end to stdout:

-----------------------------
Simulation report
-----------------------------
Number of threads: 8
Scale            : 0.4
Network size     : 30882
Num connections  : 119605898
Simulated time   : 1000 ms
Total spikes     : 109493
Population spikes: [[10910 8421] [38636 14079] [16251 3848] [7539 9809]], sum: 109493

Time create nodes: 0.011213 s
Time connect     : 49.2629 s
Time prepare     : 4.53553 s
Time simulate    : 23.8674 s
-----------------------------

I hope this is suitable for SPEC use. With scale 0.4, it should be reasonable for a single thread, with scale 1.0 for larger numbers of threads. As I mentioned earlier, NEST benefits from jemalloc/tcmalloc and similar, and from thread pinning.

heshpdx commented 2 weeks ago

Thanks! I will try this out. For the verification to be scalable, we will change the output to look like the following (which might require commenting out some of the source):

-----------------------------
Simulation report
-----------------------------
Scale            : 0.4
Network size     : 30882
Num connections  : 119605898
Simulated time   : 1000 ms
Total spikes     : 109493
Population spikes: [[10910 8421] [38636 14079] [16251 3848] [7539 9809]], sum: 109493
-----------------------------

Basically, cut out any run-specific information such as thread count and time. Then we can diff this output while applying a tolerance to the spike numbers (if we need to).

Would it be easy to do this for hpc_benchmark too? I will take a look tomorrow.

heplesser commented 2 weeks ago

To change the output, you only need to comment out or remove a few lines at the end of the script. The "Population spikes" line is only there as a double-check, "Total spikes" should suffice to check that the simulation is running correctly. "Network size" and "Num connections" depend on scale, but not on the number of threads or on the random seed. For actual benchmark runs, you may want to pass "record=false" as userarg. Then, no spike recorders are created and you will only get the "Total spikes", not the "Population spikes". I just pushed another update to the script so that it, if recording, records always from all neurons, removing a potential source of confusion.

It should be rather straightforward to apply the same user argument handling (at the beginning of the script) and reporting (at the end) to hpc_benchmark. Them main difference is that hpc_benchmark only has two neuron populations, E(xcitatory) and I(nhibitory). Let me know if you run into problems!

heshpdx commented 2 weeks ago

Thank you, this looks good. If I run it multiple times with the same cmdline, I get the same results. When I run it with varying number of threads, the Total spikes and Population spikes aren't exact but they are close. What kind of delta would you expect? Here is some data using --userargs=threads=N:scale=.3:seed=108:record=true microcircuit_spec.sli and varying N. I am just showing the numbers that are changing.

N=13, binary built with GCC-12 -O3
    Number of local nodes: 23355
Total spikes     : 83873
Population spikes: [[8740 6585] [29131 10682] [12872 2916] [5495 7452]], sum: 83873
N=21, with GCC-12 -O3
    Number of local nodes: 23483
Total spikes     : 83266
Population spikes: [[8634 6305] [29095 10586] [12701 2920] [5607 7418]], sum: 83266
N=21, with LLVM-16 -O3
    Number of local nodes: 23483
Total spikes     : 82910
Population spikes: [[8269 6292] [29086 10566] [12745 2918] [5640 7394]], sum: 82910
N=28, with GCC-12 -O3
    Number of local nodes: 23483
Total spikes     : 82059
Population spikes: [[8689 6498] [29025 10433] [11816 2892] [5361 7345]], sum: 82059
N=58, with GCC-12 -O3
    Number of local nodes: 24075
Total spikes     : 84633
Population spikes: [[9307 6772] [28882 10516] [13149 2942] [5525 7540]], sum: 84633
N=58, with LLVM-16 -O3
    Number of local nodes: 24075
Total spikes     : 83191
Population spikes: [[8575 6375] [29072 10464] [12682 2925] [5609 7489]], sum: 83191
N=160, with GCC-12 -O3
    Number of local nodes: 25707
Total spikes     : 82472
Population spikes: [[9192 6471] [28472 10390] [12376 2929] [5269 7373]], sum: 82472
N=160, with LLVM-16 -O3
    Number of local nodes: 25707
Total spikes     : 82667
Population spikes: [[8948 6371] [28917 10423] [12360 2917] [5340 7391]], sum: 82667

It appears that the number of local nodes is growing as I add more threading. My machine maxes out with 160 hardware cores, and that shows the most number of local nodes, although the "network size" is invariant across all these runs at 23163. Is this ok? Are any of the output values "out of bounds" in your opinion? It looks like we need a tolerance of about 3% for the spike counts and sum.

Is there a way to output just the "Simulation Report" to a separate file instead of stdout? That way I can also avoid lines like Number of OpenMP threads: 13 which I would have to mask out. If not, I can hack the source to remove that stanza.

Also, I diffed this script against the microcircuit.sli in your mainline. There are a lot of changes! I'm wondering, could you apply the same kind of changes to hpc_benchmark? Humbly speaking, it would be a lot faster for you to do it correctly than for me to learn it.

Finally, I have been using BrodyHopfield.sli and brunel_ps.sli as the "test" and "train" workloads which are meant to be very small and kind of small, respectively. Can you help print the same kind of simulation report at the end of those tests to facilitate verification? I tried copying the stanzas from microcircuit.sli, but I am not sure which variables these simulations know about. (alternatively I can scale down microcircuit.sli to also be the test workload, but I would like to have some more code coverage, and not have the "training" workload be exactly the same as the reference benchmark).

Thank you!

heplesser commented 2 weeks ago
heshpdx commented 2 weeks ago
heplesser commented 1 week ago

@heshpdx I have just pushed a new version of microcircuit_spec.sli which writes the simulation report to microcircuit_spec.rpt. I will look at the other examples later this week. I would suggest not to use BrodyHopfield or brunel_ps. To get different computational loads than with microcircuit, I think hpc_benchmark makes sense (fewer synchronization points than microcircuit due to longer delays) and a model with plastic synapses (more complex memory access patterns due to need to inspect spike history for spike-time dependent plasticity). I will look for a suitable model.

heshpdx commented 1 week ago

I pulled your code and started playing around with the scripts. I also see hpc_benchmark and cuba_stdp have spec versions, thank you. I had to make this small change to hpc to avoid a syntax error.

diff --git a/examples/nest/hpc_benchmark_spec.sli b/examples/nest/hpc_benchmark_spec.sli
index 68b85e2b3..f2b7b23f7 100644
--- a/examples/nest/hpc_benchmark_spec.sli
+++ b/examples/nest/hpc_benchmark_spec.sli
@@ -342,15 +342,11 @@ RunSimulation
   (Scale            : ) <- scale <- endl
   (Network size     : ) <- ks /network_size get <- endl
   (Num connections  : ) <- ks /num_connections get <- endl
-  (Simulated time   : ) <- t_sim add <- ( ms) <- endl
+  (Simulated time   : ) <- t_sim <- ( ms) <- endl
   (Total spikes     : ) <- ks /local_spike_counter get <- endl
   (Average rate     : ) <- ks /local_spike_counter get cvd NE NI add cvd div t_sim div 1000 mul <- ( sp/s) <- endl

I'll try out some values for scale so that the simulation runtime fits within our time and memory budget!

heshpdx commented 1 week ago

I played around with microcircuit, cuba_stdp, and hpc_benchmark. The first two are resilient to various system changes and can verify within 3% delta. hpc_benchmark gives a drastically different answer between GCC and LLVM.

I ran the models with 80 threads and a Scale which requires about 15 Gb of memory. About 3 minutes of runtime gives:

$ cat gcc-12/hpc_benchmark_spec.rpt
Scale            : 3.2
Network size     : 36001
Num connections  : 405036000
Simulated time   : 4000 ms
Total spikes     : 15322083
Average rate     : 106.403 sp/s
$  diff gcc-12/hpc_benchmark_spec.rpt llvm-16/hpc_benchmark_spec.rpt
5,6c5,6
< Total spikes     : 15322083
< Average rate     : 106.403 sp/s
---
> Total spikes     : 12307858
> Average rate     : 85.4712 sp/s

Do you have insight into why LLVM/clang is so different? We have gone through the source and removed all unstable algorithms and other sources of differences we know of. For example, we replaced std::sort with std::stable_sort, replaced the uniform_distribution/poisson_distribution calls with the LLVM versions as compiled code, and replaced all calls to std::rand with our own "specrand" generator. Usually this does the trick, but there is something else afoot. I would love to get hpc_benchmark working since you mentioned above it is the best one. But I am glad that we have cuba_stdp and microcircuit working at least.

heshpdx commented 1 week ago

Some more datapoints, first one from gcc-12 -Ofast -fno-finite-math-only and second one from same gcc-12 binary, and both of these are running 40 threads. This is good because it tells me it isn't just a gcc vs clang issue. It turns out, both of these verify correctly with 80 threads, but when run with 40 threads there are fewer spikes. This points to a workload issue.

$ diff gcc-12/hpc_benchmark_spec.rpt gcc-12-Ofast/hpc_benchmark_spec.rpt
5,6c5,6
< Total spikes     : 15322083
< Average rate     : 106.403 sp/s
---
> Total spikes     : 9233948
> Average rate     : 64.1246 sp/s
$ diff gcc-12/hpc_benchmark_spec.rpt gcc-12-40t/hpc_benchmark_spec.rpt
5,6c5,6
< Total spikes     : 15322083
< Average rate     : 106.403 sp/s
---
> Total spikes     : 13967592
> Average rate     : 96.9972 sp/s
heplesser commented 1 week ago

Due to the plasticity in the network, simulations of hpc_benchmark can take very different trajectories, especially if simulated for more than 300 ms. This can lead to wildly differing firing rates, which generally are far higher than biologically plausible rates (around 10 sp/s).

To avoid this problem, I have just pushed a version of hpc_benchmark in which the learning rate is set to zero by default, so that synaptic weights remain constant. The simulation still goes through the mechanics of the spike-time dependent plasticity mechanism, but the dynamics will now stay much more stable. Simulation times should also come down noticeably, since firing rates remain reasonable. Furthermore, variations for different seeds/thread numbers/compilers should now be only a few percent.

The variation in rates which you observed between GCC and Clang and between different thread numbers is most likely due to different random number sequences generated (poisson random variates). To test whether differences you observe between different thread numbers/compilers are reasonable, you should run simulations with five different random seeds and note the variation in spike numbers you get. Changing the number of threads or the compiler should lead to variations in spike numbers comparable to what you see for different seeds.

heshpdx commented 1 week ago

Great, thank you. I had increased the simulation time because it seemed that the majority of the runtime was in the setup phase when the network is being constructed (especially for larger Scale networks), and I wanted to make sure we are benchmarking the simulation as well.

heshpdx commented 1 week ago

The new hpc script works well. I had to increase tolerance from 3% to 4% to empirically allow results from 160-thread runs that looked ok otherwise. I did notice that the rng_seed value printed in the report is always 1. Maybe it is printing a bool instead of the value?

heplesser commented 1 week ago

4% tolerance sounds perfectly fine. I am surprised that you always get 1 for the rng_seed. In my tests, I see the seed value that I pass as seed=123 in the userargs.

heshpdx commented 1 week ago

My apologies, the seed issue is my own fault! I had changed the source last year when I ingested the code into the harness, to peg all runs with the same seed. Let me think if I want to open that up again, or keep it fixed for all time.

void
nest::RandomManager::set_status( const DictionaryDatum& d )
{
  nest_long_t rng_seed;
  bool rng_seed_updated = updateValue< nest_long_t >( d, names::rng_seed, rng_seed );

#ifdef SPEC
  base_seed_ = 1;
#else
  if ( rng_seed_updated )
  {
    if ( not( 0 < rng_seed and rng_seed < ( 1L << 32 ) ) )
    {
      throw BadProperty( "RNG seed must be in (0, 2^32-1)." );
    }

    base_seed_ = static_cast< std::uint32_t >( rng_seed );
  }
#endif
heplesser commented 1 week ago

I see your point of fixing the seed for consistency in benchmarks. But NEST has a fixed default seed already, see

The advantage of a configurable seed is that you can check for the range of variation in simulation results for different random number sequences.

heshpdx commented 1 week ago

Yes, I agree. I reverted my change so as to gain the flexibility. Things are running well now. I will let you know how it goes!

heplesser commented 2 days ago

@heshpdx Could you contact me by email (hans.ekkehard.plesser@nmbu.no)?