neuronsimulator / nrn

NEURON Simulator
http://nrn.readthedocs.io
Other
376 stars 113 forks source link

Hines/thread padding #2951

Open nrnhines opened 6 days ago

nrnhines commented 6 days ago

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Using the model of #2787, improvement is small or nonexistent on an Apple M1. Timing results for 8 threads are

nthread   8.2.4.   master.  padding(this PR)
8 run.      16.69.   17.57.       17.38
8 solve    15.41     15.56.      15.49

Perhaps there is a datahandle overhead for plotting that could be solved by caching.

azure-pipelines[bot] commented 6 days ago

✔️ 7492f82f4f73066b38c62245c113c203c4513cee -> Azure artifacts URL

bbpbuildbot commented 6 days ago

Logfiles from GitLab pipeline #219500 (:no_entry:) have been uploaded here!

Status and direct links:

sonarcloud[bot] commented 2 days ago

Quality Gate Passed Quality Gate passed

Issues
33 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

azure-pipelines[bot] commented 2 days ago

✔️ 7ab6c4e3caeb78f28e0ab86c984088ff717d8ef8 -> Azure artifacts URL

bbpbuildbot commented 2 days ago

Logfiles from GitLab pipeline #220360 (:no_entry:) have been uploaded here!

Status and direct links:

mgeplf commented 2 days ago

An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).

Do you have a sense how often there is false sharing? IE: How often they are written to? I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

nrnhines commented 1 day ago

Do you have a sense how often there is false sharing? IE: How often they are written to?

Not really. We've been circling around the 8.2 vs 9.0 performance issue #2787 for quite a while and have made a lot of progress (mostly caching pointers to doubles such as diam and area). With perf I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code. As you can see from this PR, the code changes for the padding experiment were quite simple and I'm delighted that the SoA permutation support allowed it to be carried out.

I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.

You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. In 9.0, threads constitute merely a permutation of the data (into thread partitions) where each partition for each SoA variable (with this PR) begins on a cacheline boundary. It is actually quite beautiful to me to see the tremendous simplification of threads in 9.0 into a fairly trivial low level permutation. The 8.2 and before implementation of threads was a far reaching and complex change involving a great deal of memory management and copying between representations.

At the moment, even with this PR, 8.2 has a slight performance edge over 9.0 on the #2787 model issue with gcc. That performance edge disappears for the Apple M1 and is much smaller with the intel compiler on x86_64.

The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 . But pinning that down with confidence requires a bit more testing.