Open bsless opened 3 years ago
Thanks. Interesting data. I am looking at alternative means for the numeric work example (in retrospect). Curious if there's a way to peg a thread to a cpu via affinity from the JVM. Also curious if invoking mapv is causing anything funny to happen.
Also curious if invoking mapv is causing anything funny to happen
Wondering about that as well. Next item to try on the list was a CompletableFuture and join on all of them in the end. Idea is avoiding a situation where an object is shared between multiple threads. Also, all the extra cache misses are from Thread parking.
Yeah, during early discussions I went down the thread local state rabbit hole (and thread local allocation buffers TLAB) etc. Could be another thing. I am not well versed on JVM internals enough to control the low level thread locality details, but I am definitely interested.
Not necessarily related to TLAB, was using this link as reference: https://www.baeldung.com/java-false-sharing-contended
In trying to understand the differences in results, I tried to narrow on one specific case, numeric work, 1 task vs 8 tasks, profiled using perf:
We can easily see lower throughput, many more cache loads and more stalls. I hoped cache misses might explain the discrepancy but that doesn't seem to be the case. Anything you see here which might explain the results?