CAS Atomic is not NUMA aware

SoilRos commented 11 months ago

I just measured my machine (Ryzen Milan) and got very similar results as https://github.com/nviennot/core-to-core-latency#dual-amd-epyc-7r13-48-cores-milan-3rd-gen-2021-q1.

However, I was not happy with this asymmetric result on the dual socket case so I re-implementing the CAS benchmark (but in c++ since I don't know rust). I found out the strange behavior is just artifact from the fact that the atomic variable is being stored in one Numa domain (first touch policy) but used in another one. The solution is to create a new atomic variable on every new cycle or to move the page containing with the atomic to the ping/pong threads.

nviennot commented 11 months ago

It's actually a feature, not a bug!

Granted, the location of the variable is completely arbitrary, so it would make sense to sample all a bunch of locations.

But if we did sample all of them, how do we show the results in a meaningful way? I'm not sure how to do this correctly. We could easily lose the visual representation that the more distance from that variable, the more the latency goes up.

I'm open to suggestion. What do you think would make things better?

At the very least, we could provide an offset (or seed) to move the location of the variable around and see the effect.

SoilRos commented 11 months ago

A result that depends on how the OS gods woke up today seems like a bug to me ;-) For instance, my picture was blue in the second socket because the process seems to have started there. Tomorrow when I restart the machine or install another OS I may get a different picture.

Now, your alternative sounds interesting if one wanted to measure the the latency to the farthest main memory. But the name of this repository implies that we want to measure how much time does it take a core A to send/receive a message to a core B. Then, I would (perhaps naively) assume that the message is owned by one of the two cores A or B, and not by another arbitrary core C (which in this case happened to have started the program in an arbitrary position).

Naturally, this is my point of view where I am used to fine tune the placement of variables in memory because it yields more preference. But I understand that in other OSs like darwin (for macOS) you cannot even do this as they will force you to interleave memory or not even allow you to pin threads.

SoilRos commented 11 months ago

Dual Socket AMD EPYC 7713

Arbitrary placement

core-to-core-latency-milan(1)

Core A (ping) placement

core-to-core-latency-milan(2)

nviennot / core-to-core-latency