google / benchmark

A microbenchmark support library
Apache License 2.0
8.9k stars 1.61k forks source link

[FR] ability to create artifical hotspots #1154

Open aerosayan opened 3 years ago

aerosayan commented 3 years ago

Intro

I was trying to measure the performance impact of a single cache miss while reading the middle element of an array containing 1,000,000 integers. It required 0.437 ns according to benchmark's results. However, I don't know for sure, if the result is being cached on repeated iterations over s, as we're only accessing a single element, and nothing else.

It would be bad, if the data was already cached, and the benchmark was meaningless.

This is the section of the code I was analyzing.

    for(auto _ : s)
    {
        int duh = x[500000]+x[500000]; // duh
        benchmark::DoNotOptimize(duh); // duh
    }

There's a little bit of need for functions that trash the data cache, instruction cache, and cause other anomalies. There's some need for methods that artificially cause branch mispredictions, L1-L2-L3 cache misses, etc.

It would be very helpful to simulate when a system is noisy, and the performance of codes in the first few cycles, when the cache isn't hot. This is a very important case for many industries.

Such conditions occur in very low latency applications like High Frequency Trades, game engine development, rendering engine development, numerical simulation codes, and being able to simulate such artificial hotspots, at the call of a single function or macro, would be extremely helpful.

Describe the solution you'd like

There could be other artificial hotspots that might interest people in other domains, but the few above would suit the needs of those interested in numerical computations, and many others.

Describe alternatives you've created

Artificially created and used hotspots on my own, but that slows down the fast and ready-to-go nature of using benchmark.

I'm already able to do all of this through my custom code, but it would be helpful if at least some were available in the benchmark, and could be something as simple like benchmark::TrashL1Cache(), benchmark::BranchMispredictionEveryNIterations(1234) where 1234 is the number of iterations after which a branch misprediction would happen, etc

Some of the things like branch misprediction is not difficult to code at all, but it would be helpful if it was available in the stock benchmark just a function call away.

Thanks

dmah42 commented 3 years ago

these sound pretty good to me. are there situations where you think we wouldn't want to call these between benchmark runs?

also, have you tried the new --benchmark_enable_random_interleaving flag? it tries to run the benchmarks in the same executable in a way that would disrupt any caching.

aerosayan commented 3 years ago

Hi dominic,

I didn't know about the random interleaving flag. I will try it out. The 0.437 ns CPU time I got seems very suspicious to me. Most likely the result is incorrect because the result was cached.

The assembly generated for duh = ... is a MOV instruction to load the data into EAX register, then an ADD instruction of EAX register with itself. So, the code wasn't optimized away completely, but something else happened.

Regarding when we wouldn't want to call these between benchmark runs, I will have to think about cases from numerical computation perspective. I will let you know about it later. However, my use cases will definitely not be the same as someone working in High Frequency Trading. Allowing the user to selectively activate/deactivate certain hotspots, would probably the most general solution that allows full control to the user, without holding them back.

Currently I can tell how it might be implemented to allow user satisfaction while using them.

There are some small steps that could be taken, and hopefully more community members are also interested in this feature, and help in coming up with ways to use the artificial hotspots more user friendly, and useful.

Thanks

aerosayan commented 3 years ago

Also, it would be necessary to deactivate the performance measurement timers, while the artificial hotspots are running.

TrashL1Cache()
{
   disable_timer();

   // code to create hotspot ...
   // code to create hotspot ...
   // code to create hotspot ...

   enable_timer();
}

for(auto _ : s)
{
   benchmark::TrashL1Cache();
   code_i_want_to_measure = x[500000]+x[500000];
}

Thanks

aerosayan commented 3 years ago

For the developers who might implement features to trash the caches:

Due to the "synchronization" that happens between the different levels of caches, it might be difficult to trash only the L1 cache, without trashing the L2 and L3 caches. This applies to trashing any level of cache, but I'm explaining only with respect to the L1 cache.

A naive implementation might create large array, and read from it directly. This would definitely trash the L1 cache, but most likely it would also trash some portions of the L2 and L3 caches. Since the data has to be passed from the RAM-> to L3 cache -> to L2 cache -> to L1 cache. When trashing the L1 cache, this might not be an issue, as L2 and L3 caches are significantly larger, and the user may be okay with it.

However, many users may want to only want to pollute the L1 cache.

In such cases, a more advanced and correct implementation will ask the user to send a pointer to some data in an array, that is most likely to be present in the L2 cache, and use it to pollute the L1 cache. This would ensure that the L1 cache is only polluted, while the data in the L2 and L3 caches remain the same.

We need community input to come up with consistent ways to do these things.

Something like:

for(auto _ : s)
{
  for(int i=0; i<n; ++i)
 {
    benchmark::TrashL1CacheUsingDataInL2Cache(&data_probably_in_l2_cache[0],i);
    int x  =  my_test();
 }
}

As you can see, this becomes complicated, and we need more community help to do this consistently.

Thanks

dmah42 commented 3 years ago

if it's a runtime call, as you've suggested, why would it need to be a compile-time flag? you could just create two benchmarks: one calls the new method and one doesn't.

aerosayan commented 3 years ago

@dominichamon I agree. That's a much effective solution.

HFTrader commented 1 year ago

@aerosayan You could start with an array of ints with a small size, say 4 elements containing and create a sequence of indices that allow you to iterate randomly through the entire array. As the array size grows, the cache misses will grow as well. If you grow the array size exponentially, you will be able to pinpoint the L1/L2/L3 caches and their latencies. I have a test for this on my blog.