Open jmftrindade opened 5 years ago
I doubt this is in any way benchmark-library-specific. Does the actual memory get freed, or is it too only growing?
I suspect your problem is that you're adding the user-defined counter
(state.counters
) inside the state loop. This will be adding thousands to
millions of entries into the map.
If you see the example tests https://github.com/google/benchmark/blob/478eafa36bb8763e04c61d88bb2b8e9fa3440b82/test/user_counters_test.cc#L25 that use user counters, they set them outside the state loop.
Dominic Hamon | Google There are no bad ideas; only good ideas that go horribly wrong.
On Mon, Apr 29, 2019 at 4:32 PM Roman Lebedev notifications@github.com wrote:
I doubt this is in any way benchmark-library-specific. Does the actual memory get freed, or is it too only growing?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/benchmark/issues/806#issuecomment-487627675, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD4QMTBQTMDEJNLOXI5MALPS4IHDANCNFSM4HI2BZIQ .
Oh, i missed that, it indeed looks very suspicious.
Thanks for looking into this. Memory growth behavior ends up being pretty much the same, however, whether counters are set outside the loop (or even disabled) or not.
Moving the setting of the counter outside of the state loop is ok in our case -- since it is tracking a deterministic measure that doesn't change across different runs for the same range -- so I did that for our actual benchmarks. I imagine there could be valid use cases for updating a user counter within the state loop though. For example, when the metric of interest changes across iterations (in one of our benchmarks we have a non-deterministic result that we like to correlate with wall time) and the user would like to keep track of a summary stat (e.g., avg) for values across iterations of the same benchmark range. Are there other ways to have user counters keep track of values across iterations for the same value of state.range(0)
, aside from setting the user counter within the loop?
Also, since state.counters
is a map<std::string,Counter>
, updating the same key with a different value more frequently should keep memory usage roughly the same (because it's just creating more instances of Counter
's constructor, but overriding the same key in state.counters
). It's an oversimplification, but in the snippet below, mem usage is the same regardless of upper bound on the for loop:
#include <map>
#include "benchmark/benchmark.h"
int main() {
std::map<std::string, benchmark::Counter> m;
size_t upper = 1000000000;
for (size_t i = 0; i < upper; i++) {
m["counter_name"] = benchmark::Counter(i);
}
}
As I reported above, I only see this high mem usage in our multithreaded benchmarks. And the only change that seems to reduce memory footprint is to reduce the number of threads. In fact, if you take the original snippet and change Threads(2)
, Threads(4)
, and Threads(8)
all to Threads(1)
, virt mem stays the same at 9.37G for the entire run of the 4 benchmarks.
For multithreaded, virt mem grows by over 7G per additional thread (9.37G with 1 thread, 18.3G for 2, 32.8G for 4, and 61.1G for 8), and peak actual mem usage is not too much smaller than virtual. Could it be because each additional benchmark thread registers a separate benchmark instance? And does that imply that each thread has an entire copy of the benchmark? That would also explain why the user counter has the sum of values for all threads unless set as Counter::kAvgThreads
, and why mem footprint increases so much as a function of the number of threads.
BTW, only virtual memory doesn't get freed, actual memory does. I'm running this on an Ubuntu box with only 64G of RAM, and fairly conservative swappiness setting, hence why virtual memory is not getting freed. I was still puzzled at what I thought was thread local storage was so high, but it sounds like multithreaded benchmarks are entire runs of the same benchmark, each run as a unit in parallel. Is that an accurate description?
Are there other ways to have user counters keep track of values across iterations for the same value of state.range(0), aside from setting the user counter within the loop?
You should consider setting value outside of the for (auto _ : state)
, so it does not affect the measurements.
Could it be because each additional benchmark thread registers a separate benchmark instance? And does that imply that each thread has an entire copy of the benchmark? That would also explain why the user counter has the sum of values for all threads unless set as
Counter::kAvgThreads
, and why mem footprint increases so much as a function of the number of threads. I was still puzzled at what I thought was thread local storage was so high, but it sounds like multithreaded benchmarks are entire runs of the same benchmark, each run as a unit in parallel. Is that an accurate description?
Yes, every thread is unique; that is as much as i hoped is stated in the docs:
https://github.com/google/benchmark#multithreaded-benchmarks
Multithreaded Benchmarks
In a multithreaded test (benchmark invoked by multiple threads simultaneously),
it is guaranteed that none of the threads will start until all have reached the start of
the benchmark loop, and all will have finished before any thread exits the benchmark loop.
(This behavior is also provided by the KeepRunning() API)
Ok, thanks. Now I also understand why with UseRealTime
I get smaller runtimes as I increase the number of threads -- even though these threads are not sharing any benchmark state, the "real time" is just total CPU time over all separate runs of the benchmark, divided by number of threads.
I was pleasantly surprised that I didn't have to add my own thread synchronization, and was getting a beautiful linear scaling with number of threads for the code we're benchmarking, but now I get why :-)
What's the primary use case for these "multithreaded" benchmarks? The name is a bit deceiving, as it implies a single benchmark being run by multiple threads sharing state, and likely partitioning the number of iterations amongst themselves evenly.
Now I also understand why with
UseRealTime
I get smaller runtimes as I increase the number of threads -- even though these threads are not sharing any benchmark state, the "real time" is just total CPU time over all separate runs of the benchmark, divided by number of threads.
I'm not sure that assessment is accurate, but that being said i'm not sure about which timer you are talking about.
There are two timers - Time
and CPU
. Time
is wall time, CPU
is time of the main thread only.
By default CPU
timer is used to decide for how long to run the benchmark.
UseRealTime
tells it to use wall clock to decide for how long to run.
Thread count shouldn't change the wall clock (Time
), not sure about CPU
.
References: #671, #769, #763.
What's the primary use case for these "multithreaded" benchmarks?
Running the exact same code in parallel, and seeing how it scales with the number of threads, i guess? I personally have never used it.
Now I also understand why with
UseRealTime
I get smaller runtimes as I increase the number of threads -- even though these threads are not sharing any benchmark state, the "real time" is just total CPU time over all separate runs of the benchmark, divided by number of threads.I'm not sure that assessment is accurate, but that being said i'm not sure about which timer you are talking about.
There are two timers -
Time
andCPU
.Time
is wall time,CPU
is time of the main thread only.
Ok, what I'm saying is that the reported Time
equals CPU
divided by number of threads because that's what I'm seeing in the results for the toy multithreaded bench I posted above, which uses real time (UseRealTime()
). That is, the reported Time
is always matching CPU
time divided by the number of threads (e.g., 598 ms = 1196 ms / 2
, 333 = 1332 / 4
, etc), as we can see below:
-----------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------
BM_MemGrowthMultithreaded/1000/real_time/threads:1 0 ms 0 ms 1229603
BM_MemGrowthMultithreaded/10000/real_time/threads:1 0 ms 0 ms 27941
BM_MemGrowthMultithreaded/100000/real_time/threads:1 0 ms 0 ms 3453
BM_MemGrowthMultithreaded/1000000/real_time/threads:1 2 ms 2 ms 326
BM_MemGrowthMultithreaded/10000000/real_time/threads:1 24 ms 24 ms 30
BM_MemGrowthMultithreaded/100000000/real_time/threads:1 227 ms 227 ms 3
BM_MemGrowthMultithreaded/500000000/real_time/threads:1 1096 ms 1096 ms 1
BM_MemGrowthMultithreaded_BigO 2.19 N 2.19 N
BM_MemGrowthMultithreaded_RMS 2 % 2 %
BM_MemGrowthMultithreaded/1000/real_time/threads:2 0 ms 0 ms 2557944
BM_MemGrowthMultithreaded/10000/real_time/threads:2 0 ms 0 ms 87592
BM_MemGrowthMultithreaded/100000/real_time/threads:2 0 ms 0 ms 6408
BM_MemGrowthMultithreaded/1000000/real_time/threads:2 1 ms 3 ms 556
BM_MemGrowthMultithreaded/10000000/real_time/threads:2 13 ms 26 ms 56
BM_MemGrowthMultithreaded/100000000/real_time/threads:2 124 ms 249 ms 6
BM_MemGrowthMultithreaded/500000000/real_time/threads:2 598 ms 1196 ms 2
BM_MemGrowthMultithreaded_BigO 0.60 N 1.20 N
BM_MemGrowthMultithreaded_RMS 2 % 2 %
BM_MemGrowthMultithreaded/1000/real_time/threads:4 0 ms 0 ms 4918924
BM_MemGrowthMultithreaded/10000/real_time/threads:4 0 ms 0 ms 83564
BM_MemGrowthMultithreaded/100000/real_time/threads:4 0 ms 0 ms 11708
BM_MemGrowthMultithreaded/1000000/real_time/threads:4 1 ms 3 ms 976
BM_MemGrowthMultithreaded/10000000/real_time/threads:4 7 ms 29 ms 100
BM_MemGrowthMultithreaded/100000000/real_time/threads:4 68 ms 272 ms 8
BM_MemGrowthMultithreaded/500000000/real_time/threads:4 333 ms 1332 ms 4
BM_MemGrowthMultithreaded_BigO 0.17 N 0.67 N
BM_MemGrowthMultithreaded_RMS 1 % 1 %
BM_MemGrowthMultithreaded/1000/real_time/threads:8 0 ms 0 ms 4990984
BM_MemGrowthMultithreaded/10000/real_time/threads:8 0 ms 0 ms 85288
BM_MemGrowthMultithreaded/100000/real_time/threads:8 0 ms 0 ms 15728
BM_MemGrowthMultithreaded/1000000/real_time/threads:8 1 ms 4 ms 1272
BM_MemGrowthMultithreaded/10000000/real_time/threads:8 5 ms 41 ms 136
BM_MemGrowthMultithreaded/100000000/real_time/threads:8 51 ms 393 ms 16
BM_MemGrowthMultithreaded_BigO 0.06 N 0.49 N
BM_MemGrowthMultithreaded_RMS 1 % 1 %
By default
CPU
timer is used to decide for how long to run the benchmark.UseRealTime
tells it to use wall clock to decide for how long to run. Thread count shouldn't change the wall clock (Time
), not sure aboutCPU
. References: #671, #769, #763.
Ha, loved the name on #671 ! :-)
What's the primary use case for these "multithreaded" benchmarks?
Running the exact same code in parallel, and seeing how it scales with the number of threads, i guess?
Yeah, that's what I thought too. But if in the "multithreaded" benchmarks we always have an entire copy of the benchmark code per thread, and the work (=total #iterations?) is not divided amongst the different threads of computation, we're not measuring how it scales with the number of threads. I'm not sure exactly what we're measuring here, as it seems equivalent to running separate processes in parallel, and the final reported wall times just seem to be the time spent on main thread divided by number of threads.
I personally have never used it.
Ok, what I'm saying is that the reported
Time
equalsCPU
divided by number of threads because that's what I'm seeing in the results for the toy multithreaded bench I posted above, which uses real time (UseRealTime()
). That is, the reportedTime
is always matchingCPU
time divided by the number of threads (e.g.,598 ms = 1196 ms / 2
,333 = 1332 / 4
, etc), as we can see below:
Aha.
The CPU
is how much the main thread has spent working. That measurement is correct here.
All the threads total have spent roughly threadcount
xCPU
time. (well, assuming they did
the exact same thing; again, you might be interested in MeasureProcessCPUTime()
)
But the Time
is divided by thread count. Dunno why, it just is.
The fact that the reported
Timeis matching
CPUtime divided by the number of threads
is just a coincidence here.
So yes, that looks broken to me, although i suspect this might be intentional; sounds like #769.
I'm not sure exactly what we're measuring here, as it seems equivalent to running separate processes in parallel, and the final reported wall times just seem to be the time spent on main thread divided by number of threads.
Yes. I do suppose the name "multi-threaded" is misleading, not sure what would be better.
Aha. The
CPU
is how much the main thread has spent working. That measurement is correct here. All the threads total have spent roughlythreadcount
xCPU
time. (well, assuming they did the exact same thing; again, you might be interested inMeasureProcessCPUTime()
)
Ok, it looks like the newly introduced MeasureProcessCPUTime
could be useful for measuring speedup gained from our own internal parallel implementation if we run with Threads(1)
, so I'll take a look. But MeasureProcessCPUTime
is giving me almost the same as UseRealTime
for all configs, so not sure it's making that much of a difference: https://pastebin.com/HB5MdXEc
But the
Time
is divided by thread count. Dunno why, it just is. The fact thatthe reported
Timeis matching
CPUtime divided by the number of threads
is just a coincidence here.
Did you mean that it's a coincidence because both CPU
and Time
are divided by b.threads
in benchmark_runner.cc? Then I understand what you mean, but as per #769, those adjustments are a bit weird: both Time
and CPU
reported will be almost exactly the same if we comment out the lines below in in benchmark_runner.cc:
i.results.real_time_used /= b.threads;
i.results.manual_time_used /= b.threads;
(https://pastebin.com/hiNrYSLi has timings with these lines commented out)
So yes, that looks broken to me, although i suspect this might be intentional; sounds like #769.
👍
I do suppose the name "multi-threaded" is misleading, not sure what would be better.
Yeah, me neither ¯\(ツ)/¯
Let me know if I should close this issue, as mem usage seems to be working as expected for the current implementation of "multi-threaded" benchmarks.
Ok, it looks like the newly introduced
MeasureProcessCPUTime
could be useful for measuring speedup gained from our own internal parallel implementation if we run withThreads(1)
, so I'll take a look. ButMeasureProcessCPUTime
is giving me almost the same asUseRealTime
for all configs, so not sure it's making that much of a difference: https://pastebin.com/HB5MdXEc
IMHO MeasureProcessCPUTime()
only makes sense when
Threads(1)
If there is no threading within the benchmark snippet, then naturally there is just one thread doing work, and naturally the entire cpu will only spend as much time as the main thread has spent.
Did you mean that it's a coincidence because both CPU and Time are divided by b.threads in benchmark_runner.cc?
I'm saying it's a coincidence that
https://github.com/google/benchmark/issues/806#issuecomment-488382817 That is, the reported
Time
is always matchingCPU
time divided by the number of threads
It happens to be the case here, because every thread is doing the exact same work. It wouldn't be so if one (or all) thread would just sleep with no work.
I do suppose the name "multi-threaded" is misleading, not sure what would be better.
Yeah, me neither ¯_(ツ)_/¯
Concurrent might be a bit better? Guess you should open a new issue for that.
Let me know if I should close this issue, as mem usage seems to be working as expected for the current implementation of "multi-threaded" benchmarks.
Would it have been better if any of this would be in docs / is there anything brought up in this issue appears to be omitted in docs?
IMHO
MeasureProcessCPUTime()
only makes sense when
- you don't use benchmark-level threading, i.e. only when
Threads(1)
- AND you use threading within your benchmark, to do the same amount of work with less walltime
Yeah, and #2 is the case for us. The snippet I posted is a toy bench to replicate the mem footprint "issue" I observed on my first attempt at using "multi-threaded" google benchmarks. My original goal in the actual bench was to try and measure the speedup of an additional level of parallelism for our parallel data structure. Namely, where it is also read by multiple threads. I thought we could use "multi-threaded" googlebenchmark for that, but I think it's not the case given its current semantics.
Concurrent might be a bit better? Guess you should open a new issue for that.
People tend to associate concurrency with shared state / traditional multi-threading. In this sense, the current implementation is acting more like "parallel", since they're not sharing state:
https://joearms.github.io/published/2013-04-05-concurrent-and-parallel-programming.html
But I still wouldn't use "parallel" as it reminds users of parallel programming. That's not what the benchmark threads are doing here, work is not partitioned amongst them for the same benchmark to make it go faster.
So I struggle with thinking of a good use case for the current "multi-threaded" benchmark semantics. Will open a new issue for that, as you suggested, and move these comments there. Existing use cases for it may help inform the decision here.
Would it have been better if any of this would be in docs / is there anything brought up in this issue appears to be omitted in docs?
I think so, yes. Perhaps a sentence or two on the README regarding how the "multi-threaded" implementation spawns self-contained benchmark instances, and users should expect memory footprint for the benchmark program to grow linearly in the number of threads.
My original goal in the actual bench was to try and measure the speedup of an additional level of parallelism for our parallel data structure. Namely, where it is also read by multiple threads. I thought we could use "multi-threaded" googlebenchmark for that, but I think it's not the case given its current semantics.
Makes sense, and is indeed not what "multi-threaded" googlebenchmark is about.
I also had this issue, and fixed it by not using Threads(?)
, but manually adding parametrized benchmark entries, passing the increasing thread count to each, and using the specified thread count to set the number of threads to use.
https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/utilities/rsbench/main.cpp#L174-L182
https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/utilities/rsbench/main.cpp#L212-L215
https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/utilities/rsbench/main.cpp#L94-L100
rawspeed_get_number_of_processor_cores()
being the function that i specify in every num_threads()
clause of omp parallel
directive.
https://github.com/darktable-org/rawspeed/blob/5f6e9eb327fb8ab9d9b66b30523d7ffea110866c/src/librawspeed/decompressors/VC5Decompressor.cpp#L721
...
Hi, there! It looks like on multithreaded benchmarks virtual memory keeps growing and is never freed. Valgrind only shows a leak of 240 bytes due to https://github.com/google/benchmark/issues/782 so not sure why virtual mem only keeps growing. Also, I only see this on multithreaded benchmarks.
For example, on this bench below, which just adds numbers to a vector, virtual mem quickly gets to upwards of 60GB and never seems to get freed in between benchmark runs:
Is this expected behavior? Thanks! --Joana