add a simple counter that aligns on 64B (size of a chache line) use that instead of having 8 B uint64s consecutive and touched by different threads
the impact is larger if the processing itself is fast, otherwise the bottleneck is somewhere else, still, not invalidating cache lines is nice
add a simple counter that aligns on 64B (size of a chache line) use that instead of having 8 B uint64s consecutive and touched by different threads the impact is larger if the processing itself is fast, otherwise the bottleneck is somewhere else, still, not invalidating cache lines is nice