[FR] ability to create artifical hotspots

aerosayan commented 3 years ago

Intro

I was trying to measure the performance impact of a single cache miss while reading the middle element of an array containing 1,000,000 integers. It required 0.437 ns according to benchmark's results. However, I don't know for sure, if the result is being cached on repeated iterations over s, as we're only accessing a single element, and nothing else.

It would be bad, if the data was already cached, and the benchmark was meaningless.

This is the section of the code I was analyzing.

    for(auto _ : s)
    {
        int duh = x[500000]+x[500000]; // duh
        benchmark::DoNotOptimize(duh); // duh
    }

There's a little bit of need for functions that trash the data cache, instruction cache, and cause other anomalies. There's some need for methods that artificially cause branch mispredictions, L1-L2-L3 cache misses, etc.

It would be very helpful to simulate when a system is noisy, and the performance of codes in the first few cycles, when the cache isn't hot. This is a very important case for many industries.

Such conditions occur in very low latency applications like High Frequency Trades, game engine development, rendering engine development, numerical simulation codes, and being able to simulate such artificial hotspots, at the call of a single function or macro, would be extremely helpful.

Describe the solution you'd like

Methods that would do a lot of busy work, without actually doing anything, thus trashing the instruction cache.
Methods that would do a lot of memory access that completely overwrites everything in L1-L2-L3 cache, or any level the user may want to selectively trash. Maybe allow the user to trash only N% of the L1/L2/L3 caches etc.
Methods that artificially cause branch mispredictions in seemingly normal code, almost out of nowhere.
Methods that cause CPU migration, context switches, and other common hotspots that suck for performance.

There could be other artificial hotspots that might interest people in other domains, but the few above would suit the needs of those interested in numerical computations, and many others.

Describe alternatives you've created

Artificially created and used hotspots on my own, but that slows down the fast and ready-to-go nature of using benchmark.

I'm already able to do all of this through my custom code, but it would be helpful if at least some were available in the benchmark, and could be something as simple like benchmark::TrashL1Cache(), benchmark::BranchMispredictionEveryNIterations(1234) where 1234 is the number of iterations after which a branch misprediction would happen, etc

Some of the things like branch misprediction is not difficult to code at all, but it would be helpful if it was available in the stock benchmark just a function call away.

Thanks

dmah42 commented 3 years ago

these sound pretty good to me. are there situations where you think we wouldn't want to call these between benchmark runs?

also, have you tried the new --benchmark_enable_random_interleaving flag? it tries to run the benchmarks in the same executable in a way that would disrupt any caching.

aerosayan commented 3 years ago

Hi dominic,

I didn't know about the random interleaving flag. I will try it out. The 0.437 ns CPU time I got seems very suspicious to me. Most likely the result is incorrect because the result was cached.

The assembly generated for duh = ... is a MOV instruction to load the data into EAX register, then an ADD instruction of EAX register with itself. So, the code wasn't optimized away completely, but something else happened.

Regarding when we wouldn't want to call these between benchmark runs, I will have to think about cases from numerical computation perspective. I will let you know about it later. However, my use cases will definitely not be the same as someone working in High Frequency Trading. Allowing the user to selectively activate/deactivate certain hotspots, would probably the most general solution that allows full control to the user, without holding them back.

Currently I can tell how it might be implemented to allow user satisfaction while using them.

The artificial hotspots could be only activated and deactivated at compile time.
It wouldn't be safe to activate/deactivate the hotspots at runtime, as we as the users may not completely trust the results. I know someone in HFT may not want to use a runtime artificial hotspot, if there's even a smallest chance of it corrupting their results. Compile time activation/deactivation is the most safest option.
The compile time activation/deactivation of the hotspots could be obviously implemented as compile time macros using something like : #if defined ENABLE_ARTIFICAL_HOTSPOTS
Instead of the user being required to add and delete the #define ENABLE_ARTIFICIAL_HOTSPOTS macro from their file again and again, it would be more streamlined, if it could be passed as a compilation flag. I know of using $ gcc -D ENABLE_ARTIFICIAL_HOTSPOTS as a solution, however it's too verbose, and if it could be simplified to something like $ gcc -D EAH1 $gcc -D EAH2 etc, it would be easier to use. This part of the design would most probably be best done, if we get input from more users, and about their preferred method of macro activation. This is something that needs to be done with proper care for user experience, but of course the basic underlying methods could be implemented without user input.
Finally, the compile time activation/deactivation of the artifical hotspots would allow the users to selectively activate/deactivate certain hotspots between runs. This is a more advanced use case, but something we see in common use. In one run, the user may want to only study the impact of L1 cache miss, in another they may want to study the impact of L3+L2 cache misses. In some cases, they may want to activate all of the hotspots in the code. All of these could be facilitated through the command line -D compiler arguments. If the developer's can come up with a more streamline process, it would be great too.

There are some small steps that could be taken, and hopefully more community members are also interested in this feature, and help in coming up with ways to use the artificial hotspots more user friendly, and useful.

Thanks

aerosayan commented 3 years ago

Also, it would be necessary to deactivate the performance measurement timers, while the artificial hotspots are running.

TrashL1Cache()
{
   disable_timer();

   // code to create hotspot ...
   // code to create hotspot ...
   // code to create hotspot ...

   enable_timer();
}

for(auto _ : s)
{
   benchmark::TrashL1Cache();
   code_i_want_to_measure = x[500000]+x[500000];
}

Thanks

aerosayan commented 3 years ago

For the developers who might implement features to trash the caches:

Due to the "synchronization" that happens between the different levels of caches, it might be difficult to trash only the L1 cache, without trashing the L2 and L3 caches. This applies to trashing any level of cache, but I'm explaining only with respect to the L1 cache.

A naive implementation might create large array, and read from it directly. This would definitely trash the L1 cache, but most likely it would also trash some portions of the L2 and L3 caches. Since the data has to be passed from the RAM-> to L3 cache -> to L2 cache -> to L1 cache. When trashing the L1 cache, this might not be an issue, as L2 and L3 caches are significantly larger, and the user may be okay with it.

However, many users may want to only want to pollute the L1 cache.

In such cases, a more advanced and correct implementation will ask the user to send a pointer to some data in an array, that is most likely to be present in the L2 cache, and use it to pollute the L1 cache. This would ensure that the L1 cache is only polluted, while the data in the L2 and L3 caches remain the same.

We need community input to come up with consistent ways to do these things.

Something like:

for(auto _ : s)
{
  for(int i=0; i<n; ++i)
 {
    benchmark::TrashL1CacheUsingDataInL2Cache(&data_probably_in_l2_cache[0],i);
    int x  =  my_test();
 }
}

As you can see, this becomes complicated, and we need more community help to do this consistently.

Thanks

dmah42 commented 3 years ago

if it's a runtime call, as you've suggested, why would it need to be a compile-time flag? you could just create two benchmarks: one calls the new method and one doesn't.

aerosayan commented 3 years ago

@dominichamon I agree. That's a much effective solution.

HFTrader commented 1 year ago

@aerosayan You could start with an array of ints with a small size, say 4 elements containing and create a sequence of indices that allow you to iterate randomly through the entire array. As the array size grows, the cache misses will grow as well. If you grow the array size exponentially, you will be able to pinpoint the L1/L2/L3 caches and their latencies. I have a test for this on my blog.

google / benchmark

[FR] ability to create artifical hotspots #1154