Can't go faster than ~2 million calls / s on a 2x12 core Xeon

There seems to be a congestion somewhere that limits performance of the empty benchmark to about 1.5-2 million calls per second. This is probably not an issue in most real benchmarks, where Cassandra is a lot slower anyways, however the performance bar for this tool is set very high, so this needs to be solved.

The issue doesn't seem to be visible on single core processors, hence I guess it could be caused by false sharing / shared atomic updates, which are inherently costly on multiprocessor machines.

pkolaczk / latte

Can't go faster than ~2 million calls / s on a 2x12 core Xeon #9