google / benchmark

A microbenchmark support library
Apache License 2.0
9.02k stars 1.63k forks source link

Growing virtual memory usage on multithreaded benchmark #806

Open jmftrindade opened 5 years ago

jmftrindade commented 5 years ago

Hi, there! It looks like on multithreaded benchmarks virtual memory keeps growing and is never freed. Valgrind only shows a leak of 240 bytes due to https://github.com/google/benchmark/issues/782 so not sure why virtual mem only keeps growing. Also, I only see this on multithreaded benchmarks.

For example, on this bench below, which just adds numbers to a vector, virtual mem quickly gets to upwards of 60GB and never seems to get freed in between benchmark runs:

#include <vector>

#include "benchmark/benchmark.h"

namespace bench {

const int kBMRangeMultiplier = 10;
const int kBMRangeLower = 1000;
const int kBMRangeUpper = 500000000;

static void BM_MemGrowthMultithreaded(benchmark::State& state) {
  std::vector<uint64_t> numbers;
  numbers.reserve(state.range(0));
  for (int i = 0; i < state.range(0); ++i) {
    numbers.push_back(i);
  }

  for (auto _ : state) {
    size_t size = state.range(0);
    std::vector<uint64_t> results;
    results.reserve(size);

    // It doesn't matter what we're adding, just creating some load.
    for (size_t i = 0; i < size; i++) {
      results.push_back(i);
    }

    state.counters["matches"] =
        benchmark::Counter(results.size(), benchmark::Counter::kAvgThreads);
  }

  state.SetComplexityN(state.range(0));
}
// Single-threaded.
BENCHMARK(BM_MemGrowthMultithreaded)
    ->Threads(1)
    ->UseRealTime()
    ->RangeMultiplier(kBMRangeMultiplier)
    ->Range(kBMRangeLower, kBMRangeUpper)
    ->Unit(benchmark::kMillisecond)
    ->Complexity(benchmark::oN);

// 2 threads.
BENCHMARK(BM_MemGrowthMultithreaded)
    ->Threads(2)
    ->UseRealTime()
    ->RangeMultiplier(kBMRangeMultiplier)
    ->Range(kBMRangeLower, kBMRangeUpper)
    ->Unit(benchmark::kMillisecond)
    ->Complexity(benchmark::oN);

// 4 threads.
BENCHMARK(BM_MemGrowthMultithreaded)
    ->Threads(4)
    ->UseRealTime()
    ->RangeMultiplier(kBMRangeMultiplier)
    ->Range(kBMRangeLower, kBMRangeUpper)
    ->Unit(benchmark::kMillisecond)
    ->Complexity(benchmark::oN);

// 8 threads.
BENCHMARK(BM_MemGrowthMultithreaded)
    ->Threads(8)
    ->UseRealTime()
    ->RangeMultiplier(kBMRangeMultiplier)
    ->Range(kBMRangeLower, kBMRangeUpper)
    ->Unit(benchmark::kMillisecond)
    ->Complexity(benchmark::oN);

}  // namespace bench

BENCHMARK_MAIN();

Is this expected behavior? Thanks! --Joana

LebedevRI commented 5 years ago

I doubt this is in any way benchmark-library-specific. Does the actual memory get freed, or is it too only growing?

dmah42 commented 5 years ago

I suspect your problem is that you're adding the user-defined counter (state.counters) inside the state loop. This will be adding thousands to millions of entries into the map.

If you see the example tests https://github.com/google/benchmark/blob/478eafa36bb8763e04c61d88bb2b8e9fa3440b82/test/user_counters_test.cc#L25 that use user counters, they set them outside the state loop.

Dominic Hamon | Google There are no bad ideas; only good ideas that go horribly wrong.

On Mon, Apr 29, 2019 at 4:32 PM Roman Lebedev notifications@github.com wrote:

I doubt this is in any way benchmark-library-specific. Does the actual memory get freed, or is it too only growing?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/benchmark/issues/806#issuecomment-487627675, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD4QMTBQTMDEJNLOXI5MALPS4IHDANCNFSM4HI2BZIQ .

LebedevRI commented 5 years ago

Oh, i missed that, it indeed looks very suspicious.

jmftrindade commented 5 years ago

Thanks for looking into this. Memory growth behavior ends up being pretty much the same, however, whether counters are set outside the loop (or even disabled) or not.

Moving the setting of the counter outside of the state loop is ok in our case -- since it is tracking a deterministic measure that doesn't change across different runs for the same range -- so I did that for our actual benchmarks. I imagine there could be valid use cases for updating a user counter within the state loop though. For example, when the metric of interest changes across iterations (in one of our benchmarks we have a non-deterministic result that we like to correlate with wall time) and the user would like to keep track of a summary stat (e.g., avg) for values across iterations of the same benchmark range. Are there other ways to have user counters keep track of values across iterations for the same value of state.range(0), aside from setting the user counter within the loop?

Also, since state.counters is a map<std::string,Counter>, updating the same key with a different value more frequently should keep memory usage roughly the same (because it's just creating more instances of Counter's constructor, but overriding the same key in state.counters). It's an oversimplification, but in the snippet below, mem usage is the same regardless of upper bound on the for loop:

#include <map>
#include "benchmark/benchmark.h"

int main() {
  std::map<std::string, benchmark::Counter> m;
  size_t upper = 1000000000;

  for (size_t i = 0; i < upper; i++) {
    m["counter_name"] = benchmark::Counter(i);
  }
}

As I reported above, I only see this high mem usage in our multithreaded benchmarks. And the only change that seems to reduce memory footprint is to reduce the number of threads. In fact, if you take the original snippet and change Threads(2), Threads(4), and Threads(8) all to Threads(1), virt mem stays the same at 9.37G for the entire run of the 4 benchmarks.

For multithreaded, virt mem grows by over 7G per additional thread (9.37G with 1 thread, 18.3G for 2, 32.8G for 4, and 61.1G for 8), and peak actual mem usage is not too much smaller than virtual. Could it be because each additional benchmark thread registers a separate benchmark instance? And does that imply that each thread has an entire copy of the benchmark? That would also explain why the user counter has the sum of values for all threads unless set as Counter::kAvgThreads, and why mem footprint increases so much as a function of the number of threads.

BTW, only virtual memory doesn't get freed, actual memory does. I'm running this on an Ubuntu box with only 64G of RAM, and fairly conservative swappiness setting, hence why virtual memory is not getting freed. I was still puzzled at what I thought was thread local storage was so high, but it sounds like multithreaded benchmarks are entire runs of the same benchmark, each run as a unit in parallel. Is that an accurate description?

LebedevRI commented 5 years ago

Are there other ways to have user counters keep track of values across iterations for the same value of state.range(0), aside from setting the user counter within the loop?

You should consider setting value outside of the for (auto _ : state), so it does not affect the measurements.

Could it be because each additional benchmark thread registers a separate benchmark instance? And does that imply that each thread has an entire copy of the benchmark? That would also explain why the user counter has the sum of values for all threads unless set as Counter::kAvgThreads, and why mem footprint increases so much as a function of the number of threads. I was still puzzled at what I thought was thread local storage was so high, but it sounds like multithreaded benchmarks are entire runs of the same benchmark, each run as a unit in parallel. Is that an accurate description?

Yes, every thread is unique; that is as much as i hoped is stated in the docs:

https://github.com/google/benchmark#multithreaded-benchmarks

Multithreaded Benchmarks
In a multithreaded test (benchmark invoked by multiple threads simultaneously),
it is guaranteed that none of the threads will start until all have reached the start of
the benchmark loop, and all will have finished before any thread exits the benchmark loop.
(This behavior is also provided by the KeepRunning() API) 
jmftrindade commented 5 years ago

Ok, thanks. Now I also understand why with UseRealTime I get smaller runtimes as I increase the number of threads -- even though these threads are not sharing any benchmark state, the "real time" is just total CPU time over all separate runs of the benchmark, divided by number of threads.

I was pleasantly surprised that I didn't have to add my own thread synchronization, and was getting a beautiful linear scaling with number of threads for the code we're benchmarking, but now I get why :-)

What's the primary use case for these "multithreaded" benchmarks? The name is a bit deceiving, as it implies a single benchmark being run by multiple threads sharing state, and likely partitioning the number of iterations amongst themselves evenly.

LebedevRI commented 5 years ago

Now I also understand why with UseRealTime I get smaller runtimes as I increase the number of threads -- even though these threads are not sharing any benchmark state, the "real time" is just total CPU time over all separate runs of the benchmark, divided by number of threads.

I'm not sure that assessment is accurate, but that being said i'm not sure about which timer you are talking about.

There are two timers - Time and CPU. Time is wall time, CPU is time of the main thread only. By default CPU timer is used to decide for how long to run the benchmark. UseRealTime tells it to use wall clock to decide for how long to run. Thread count shouldn't change the wall clock (Time), not sure about CPU. References: #671, #769, #763.

What's the primary use case for these "multithreaded" benchmarks?

Running the exact same code in parallel, and seeing how it scales with the number of threads, i guess? I personally have never used it.

jmftrindade commented 5 years ago

Now I also understand why with UseRealTime I get smaller runtimes as I increase the number of threads -- even though these threads are not sharing any benchmark state, the "real time" is just total CPU time over all separate runs of the benchmark, divided by number of threads.

I'm not sure that assessment is accurate, but that being said i'm not sure about which timer you are talking about.

There are two timers - Time and CPU. Time is wall time, CPU is time of the main thread only.

Ok, what I'm saying is that the reported Time equals CPU divided by number of threads because that's what I'm seeing in the results for the toy multithreaded bench I posted above, which uses real time (UseRealTime()). That is, the reported Time is always matching CPU time divided by the number of threads (e.g., 598 ms = 1196 ms / 2, 333 = 1332 / 4, etc), as we can see below:

-----------------------------------------------------------------------------------------------
Benchmark                                                        Time           CPU Iterations
-----------------------------------------------------------------------------------------------
BM_MemGrowthMultithreaded/1000/real_time/threads:1               0 ms          0 ms    1229603
BM_MemGrowthMultithreaded/10000/real_time/threads:1              0 ms          0 ms      27941
BM_MemGrowthMultithreaded/100000/real_time/threads:1             0 ms          0 ms       3453
BM_MemGrowthMultithreaded/1000000/real_time/threads:1            2 ms          2 ms        326
BM_MemGrowthMultithreaded/10000000/real_time/threads:1          24 ms         24 ms         30
BM_MemGrowthMultithreaded/100000000/real_time/threads:1        227 ms        227 ms          3
BM_MemGrowthMultithreaded/500000000/real_time/threads:1       1096 ms       1096 ms          1
BM_MemGrowthMultithreaded_BigO                                2.19 N       2.19 N
BM_MemGrowthMultithreaded_RMS                                    2 %          2 %
BM_MemGrowthMultithreaded/1000/real_time/threads:2               0 ms          0 ms    2557944
BM_MemGrowthMultithreaded/10000/real_time/threads:2              0 ms          0 ms      87592
BM_MemGrowthMultithreaded/100000/real_time/threads:2             0 ms          0 ms       6408
BM_MemGrowthMultithreaded/1000000/real_time/threads:2            1 ms          3 ms        556
BM_MemGrowthMultithreaded/10000000/real_time/threads:2          13 ms         26 ms         56
BM_MemGrowthMultithreaded/100000000/real_time/threads:2        124 ms        249 ms          6
BM_MemGrowthMultithreaded/500000000/real_time/threads:2        598 ms       1196 ms          2
BM_MemGrowthMultithreaded_BigO                                0.60 N       1.20 N
BM_MemGrowthMultithreaded_RMS                                    2 %          2 %
BM_MemGrowthMultithreaded/1000/real_time/threads:4               0 ms          0 ms    4918924
BM_MemGrowthMultithreaded/10000/real_time/threads:4              0 ms          0 ms      83564
BM_MemGrowthMultithreaded/100000/real_time/threads:4             0 ms          0 ms      11708
BM_MemGrowthMultithreaded/1000000/real_time/threads:4            1 ms          3 ms        976
BM_MemGrowthMultithreaded/10000000/real_time/threads:4           7 ms         29 ms        100
BM_MemGrowthMultithreaded/100000000/real_time/threads:4         68 ms        272 ms          8
BM_MemGrowthMultithreaded/500000000/real_time/threads:4        333 ms       1332 ms          4
BM_MemGrowthMultithreaded_BigO                                0.17 N       0.67 N
BM_MemGrowthMultithreaded_RMS                                    1 %          1 %
BM_MemGrowthMultithreaded/1000/real_time/threads:8               0 ms          0 ms    4990984
BM_MemGrowthMultithreaded/10000/real_time/threads:8              0 ms          0 ms      85288
BM_MemGrowthMultithreaded/100000/real_time/threads:8             0 ms          0 ms      15728
BM_MemGrowthMultithreaded/1000000/real_time/threads:8            1 ms          4 ms       1272
BM_MemGrowthMultithreaded/10000000/real_time/threads:8           5 ms         41 ms        136
BM_MemGrowthMultithreaded/100000000/real_time/threads:8         51 ms        393 ms         16
BM_MemGrowthMultithreaded_BigO                                0.06 N       0.49 N
BM_MemGrowthMultithreaded_RMS                                    1 %          1 %

By default CPU timer is used to decide for how long to run the benchmark. UseRealTime tells it to use wall clock to decide for how long to run. Thread count shouldn't change the wall clock (Time), not sure about CPU. References: #671, #769, #763.

Ha, loved the name on #671 ! :-)

What's the primary use case for these "multithreaded" benchmarks?

Running the exact same code in parallel, and seeing how it scales with the number of threads, i guess?

Yeah, that's what I thought too. But if in the "multithreaded" benchmarks we always have an entire copy of the benchmark code per thread, and the work (=total #iterations?) is not divided amongst the different threads of computation, we're not measuring how it scales with the number of threads. I'm not sure exactly what we're measuring here, as it seems equivalent to running separate processes in parallel, and the final reported wall times just seem to be the time spent on main thread divided by number of threads.

I personally have never used it.

LebedevRI commented 5 years ago

Ok, what I'm saying is that the reported Time equals CPU divided by number of threads because that's what I'm seeing in the results for the toy multithreaded bench I posted above, which uses real time (UseRealTime()). That is, the reported Time is always matching CPU time divided by the number of threads (e.g., 598 ms = 1196 ms / 2, 333 = 1332 / 4, etc), as we can see below:

Aha. The CPU is how much the main thread has spent working. That measurement is correct here. All the threads total have spent roughly threadcountxCPU time. (well, assuming they did the exact same thing; again, you might be interested in MeasureProcessCPUTime()) But the Time is divided by thread count. Dunno why, it just is. The fact that the reportedTimeis matchingCPUtime divided by the number of threads is just a coincidence here.

So yes, that looks broken to me, although i suspect this might be intentional; sounds like #769.

I'm not sure exactly what we're measuring here, as it seems equivalent to running separate processes in parallel, and the final reported wall times just seem to be the time spent on main thread divided by number of threads.

Yes. I do suppose the name "multi-threaded" is misleading, not sure what would be better.

jmftrindade commented 5 years ago

Aha. The CPU is how much the main thread has spent working. That measurement is correct here. All the threads total have spent roughly threadcountxCPU time. (well, assuming they did the exact same thing; again, you might be interested in MeasureProcessCPUTime())

Ok, it looks like the newly introduced MeasureProcessCPUTime could be useful for measuring speedup gained from our own internal parallel implementation if we run with Threads(1), so I'll take a look. But MeasureProcessCPUTime is giving me almost the same as UseRealTime for all configs, so not sure it's making that much of a difference: https://pastebin.com/HB5MdXEc

But the Time is divided by thread count. Dunno why, it just is. The fact that the reportedTimeis matchingCPUtime divided by the number of threads is just a coincidence here.

Did you mean that it's a coincidence because both CPU and Time are divided by b.threads in benchmark_runner.cc? Then I understand what you mean, but as per #769, those adjustments are a bit weird: both Time and CPU reported will be almost exactly the same if we comment out the lines below in in benchmark_runner.cc:

  i.results.real_time_used /= b.threads;
  i.results.manual_time_used /= b.threads;

(https://pastebin.com/hiNrYSLi has timings with these lines commented out)

So yes, that looks broken to me, although i suspect this might be intentional; sounds like #769.

👍

I do suppose the name "multi-threaded" is misleading, not sure what would be better.

Yeah, me neither ¯\(ツ)

Let me know if I should close this issue, as mem usage seems to be working as expected for the current implementation of "multi-threaded" benchmarks.

LebedevRI commented 5 years ago

Ok, it looks like the newly introduced MeasureProcessCPUTime could be useful for measuring speedup gained from our own internal parallel implementation if we run with Threads(1), so I'll take a look. But MeasureProcessCPUTime is giving me almost the same as UseRealTime for all configs, so not sure it's making that much of a difference: https://pastebin.com/HB5MdXEc

IMHO MeasureProcessCPUTime() only makes sense when

  1. you don't use benchmark-level threading, i.e. only when Threads(1)
  2. AND you use threading within your benchmark, to do the same amount of work with less walltime

If there is no threading within the benchmark snippet, then naturally there is just one thread doing work, and naturally the entire cpu will only spend as much time as the main thread has spent.

Did you mean that it's a coincidence because both CPU and Time are divided by b.threads in benchmark_runner.cc?

I'm saying it's a coincidence that

https://github.com/google/benchmark/issues/806#issuecomment-488382817 That is, the reported Time is always matching CPU time divided by the number of threads

It happens to be the case here, because every thread is doing the exact same work. It wouldn't be so if one (or all) thread would just sleep with no work.

I do suppose the name "multi-threaded" is misleading, not sure what would be better.

Yeah, me neither ¯_(ツ)_/¯

Concurrent might be a bit better? Guess you should open a new issue for that.

Let me know if I should close this issue, as mem usage seems to be working as expected for the current implementation of "multi-threaded" benchmarks.

Would it have been better if any of this would be in docs / is there anything brought up in this issue appears to be omitted in docs?

jmftrindade commented 5 years ago

IMHO MeasureProcessCPUTime() only makes sense when

  1. you don't use benchmark-level threading, i.e. only when Threads(1)
  2. AND you use threading within your benchmark, to do the same amount of work with less walltime

Yeah, and #2 is the case for us. The snippet I posted is a toy bench to replicate the mem footprint "issue" I observed on my first attempt at using "multi-threaded" google benchmarks. My original goal in the actual bench was to try and measure the speedup of an additional level of parallelism for our parallel data structure. Namely, where it is also read by multiple threads. I thought we could use "multi-threaded" googlebenchmark for that, but I think it's not the case given its current semantics.

Concurrent might be a bit better? Guess you should open a new issue for that.

People tend to associate concurrency with shared state / traditional multi-threading. In this sense, the current implementation is acting more like "parallel", since they're not sharing state:

https://joearms.github.io/published/2013-04-05-concurrent-and-parallel-programming.html

But I still wouldn't use "parallel" as it reminds users of parallel programming. That's not what the benchmark threads are doing here, work is not partitioned amongst them for the same benchmark to make it go faster.

So I struggle with thinking of a good use case for the current "multi-threaded" benchmark semantics. Will open a new issue for that, as you suggested, and move these comments there. Existing use cases for it may help inform the decision here.

Would it have been better if any of this would be in docs / is there anything brought up in this issue appears to be omitted in docs?

I think so, yes. Perhaps a sentence or two on the README regarding how the "multi-threaded" implementation spawns self-contained benchmark instances, and users should expect memory footprint for the benchmark program to grow linearly in the number of threads.

LebedevRI commented 5 years ago

My original goal in the actual bench was to try and measure the speedup of an additional level of parallelism for our parallel data structure. Namely, where it is also read by multiple threads. I thought we could use "multi-threaded" googlebenchmark for that, but I think it's not the case given its current semantics.

Makes sense, and is indeed not what "multi-threaded" googlebenchmark is about. I also had this issue, and fixed it by not using Threads(?), but manually adding parametrized benchmark entries, passing the increasing thread count to each, and using the specified thread count to set the number of threads to use. https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/utilities/rsbench/main.cpp#L174-L182 https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/utilities/rsbench/main.cpp#L212-L215 https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/utilities/rsbench/main.cpp#L94-L100 rawspeed_get_number_of_processor_cores() being the function that i specify in every num_threads() clause of omp parallel directive. https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/librawspeed/decompressors/VC5Decompressor.cpp#L721

...

813