Open nrnhines opened 6 days ago
✔️ 7492f82f4f73066b38c62245c113c203c4513cee -> Azure artifacts URL
Logfiles from GitLab pipeline #219500 (:no_entry:) have been uploaded here!
Status and direct links:
Issues
33 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
✔️ 7ab6c4e3caeb78f28e0ab86c984088ff717d8ef8 -> Azure artifacts URL
Logfiles from GitLab pipeline #220360 (:no_entry:) have been uploaded here!
Status and direct links:
An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).
Do you have a sense how often there is false sharing? IE: How often they are written to? I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.
Do you have a sense how often there is false sharing? IE: How often they are written to?
Not really. We've been circling around the 8.2 vs 9.0 performance issue #2787 for quite a while and have made a lot of progress (mostly caching pointers to doubles such as diam and area). With perf
I got close to measuring false sharing but didn't get to the point where I could focus on a specific section of code. As you can see from this PR, the code changes for the padding experiment were quite simple and I'm delighted that the SoA permutation support allowed it to be carried out.
I'm not familiar enough with the code, but rather than padding, most of the time I've seen false sharing issues, it can be fixed by each thread having its own data area, and only doing a single write to potential shared cache lines once it's done the work.
You are exactly correct. Ironically, that was how it was done in long ago precursors to 8.2. In 9.0, threads constitute merely a permutation of the data (into thread partitions) where each partition for each SoA variable (with this PR) begins on a cacheline boundary. It is actually quite beautiful to me to see the tremendous simplification of threads in 9.0 into a fairly trivial low level permutation. The 8.2 and before implementation of threads was a far reaching and complex change involving a great deal of memory management and copying between representations.
At the moment, even with this PR, 8.2 has a slight performance edge over 9.0 on the #2787 model issue with gcc. That performance edge disappears for the Apple M1 and is much smaller with the intel compiler on x86_64.
The bottom line so far with this experiment is that it appears that destructive cache line sharing is not a significant performance issue for #2787 . But pinning that down with confidence requires a bit more testing.
An experiment to see if performance improves if destructive (write) cacheline sharing is avoided (cache lines are assumed to be 64bytes).
Using the model of #2787, improvement is small or nonexistent on an Apple M1. Timing results for 8 threads are
Perhaps there is a datahandle overhead for plotting that could be solved by caching.