Open sharkdp opened 6 years ago
I'd be interested in putting this functionality into the tool. I guess we'd take it in as an option and have to change how the execution state is currently being tracked. Unless you were of the opinion that this would be the new default way to run timings.
I would like to discuss this first before we put this into action.
In principle, the rationale behind this feature sounds sensible. However, we also have a few features that already work in a similar direction:
Adding a --interleave-benchmarks
option would have a few disadvantages (when used):
--warmup
), the interleaving could actually interfere if the two (or more) commands do not use the same cache.Also, simplicity is a desirable goal for me. If there are command line options that we don't necessarily need, I would tend to leave them out.
That's true, the benefits are minimal and the downsides/problems presented may not be worth it. While an interesting idea in general, it may not be a correct approach for this tool. If there are already enough other facilities available in the existing feature set, this should probably be closed and we can move on to the other issues.
I'm benchmarking some code that interacts heavily with the filesystem and OS, and sorely missing this functionality. Instantaneous load spikes are one thing, since they can be removed by outlier detection or at least detected by a high variance. But there are also many ways in which the state of the system can vary slowly over time. For example:
Thermal throttling, as the OP mentioned. On my MacBook Pro this kicks in after only a few seconds of high load (saturating all CPUs), but it accumulates over time, and recovery also takes time. (Of course, desktops will often have enough thermal headroom to never throttle.) If you happen to be on macOS you can watch the current throttling level by running pmset -g thermlog
; I don't know the equivalent for other OSes.
OS background activity: long-running processes may spin up and down over time. Of course, ideally you would disable all such processes before running benchmarks, but it's easy to miss some.
The kernel page cache will gradually discard (if clean) or swap out (if dirty) memory that hasn't been accessed recently. Similarly, the vnode/inode cache (as it's called on macOS and Linux respectively) will throw out files that haven't been accessed recently. Admittedly, in most cases this shouldn't have much effect on a benchmark which mostly keeps performing the same operations – and thus touching the same pages and files – over and over. But it can have some effect.
SSDs often have their own DRAM caches which can fill over time before gradually draining to the actual flash storage.
When system state changes slowly enough, it won't show up in the form of outlier runs; it will just bias the results in favor of whichever parameters were tested under more favorable conditions.
While not a panacea, interleaving would greatly mitigate this effect.
Thank you very much for the detailed explanation. All of this sounds reasonable.
The strategy how this could be used in hyperfine is not completely clear to me (see "disadvantages" above), but I am opening this for further discussion.
I find myself in the thermal throttling on a laptop camp as well. I also have other cases that don't make sense — for example, a recent test run I did was:
with change: 3.551s
without change: 3.439s
with change: 3.311s
My "workaround" is to try and scale up the warmup and test iterations in a (futile?) attempt to try to heat-saturate the processor so that the before and after binaries run under similar throttled conditions.
Another potential solution for this particular case would be to run tests like:
And ensure that the A->B and B->A runs report "close enough" to each other.
I find myself in the thermal throttling on a laptop camp as well.
I think we should try to find a solution for that and document it here in the repository. I'm not yet convinced that the proposal in this ticket is the best solution (it might be).
Alternative solutions that come to my mind right now:
performance
. The C++ micro benchmark library gbenchmark
actually prints a warning, if /sys/devices/system/cpu/*/cpufreq/scaling_governor
is set to anything other than performance
.--prepare "sleep 10"
(with a sufficiently large time) to make sure that everything has cooled down before each timing run.I also have other cases that don't make sense — for example, a recent test run I did was:
with change: 3.551s without change: 3.439s with change: 3.311s
Could you go into more detail? These numbers are pretty much meaningless without the standard deviation. Could it be just noise or is this reproducible? Are you sure that all caching effects are eliminated (either via warmup runs or by resetting the caches with --prepare
)?
Could you go into more detail?
I'd rather not... but I will 😢
In that specific case, I was neglecting to build my binary correctly. I was doing cargo build --release
instead of cargo build --release -p my-binary
. That specific case was pure user error. 🤦
That being said, that's an example of some of the variance I get between three test runs of 10 or 25 iterations (I forget which at this point).
That being said, that's an example of some of the variance I get between three test runs of 10 or 25 iterations (I forget which at this point).
Ok, but do you consider this to be a problem? That's like a ~5% variation in runtime. What did hyperfine
actually report as standard deviation? How IO-heavy is your process (tends to be much more noisy when benchmarked)?
Not much to add here but just to say I'd also be happy to have this option. I've been benchmarking a very IO heavy program (reading & writing GBs of data) and the benchmarks are very unreliable from one run to the next (even though I'm clearing disk caches on every iteration). Which version of the program is run first influences a lot the results.
Not much to add here but just to say I'd also be happy to have this option. I've been benchmarking a very IO heavy program (reading & writing GBs of data) and the benchmarks are very unreliable from one run to the next (even though I'm clearing disk caches on every iteration). Which version of the program is run first influences a lot the results.
If this is a consistent result, I assume that some other form of caching is going on somewhere.
True. But if it's, say, an SSD's internal SLC cache, I don't think there's any way to clear it other than just waiting.
Even if you do wait, both the filesystem's and the SSD's internal data structures are going to end up in different states than they started in, which might affect performance, albeit probably not by much.
Hi, Just realized that I accidentally made a duplicate of this issue and wanted to post some of my context here.
I have run into issues with benchmarks that are very sensitive to minor condition changes, but have clear speed difference when run correctly. For example, I have done some benchmarking of compiling rather small files and also doing linking. That does a solid amount of file IO and will always have a decent amount of deviation. I have found that some of these cases, If the benchmarks are run non-interleaved, I will have to do something like -w 100 -m 1000
to get accurate results of which process is faster. This will get accurate results but will take a while to run. Since I am only interested in which case is faster, I don't really care about getting exact results (just clear performance indications). As a result, at multiple occasions, I have hacked together python scripts that time the results with interleaving. Even with a relatively small number of runs equivalent to something like -w 10 -m 50
, I am able to rather consistent indications of which script is faster and approximately by how much. The important thing is that both test runs get hit pretty evenly by all of the performance issues (scheduling, cpu temp, disk io from anything else on the machine, etc).
I have also seen similar cases when measure certain hot loop code that has rather minor performance optimization. If I simply run the scripts non-interleaved, and don't do it for long enough, the first run script will almost always be the faster. To my knowledge, this is simply due to the cpu heating up and thermal throttling. Something I was running earlier today, I started testing with -w 1000 -m 5000
in order to get consistent results. I think that some types of testing simply perform very poorly due even minor amounts of system noise. As such, I found that interleaved testing can be paramount to getting quick results.
I guess as an extra note, the other option to enable shorter runs while getting decent results is to fully reboot the machine and make sure not to open apps that could interfere with benchmarking. This works, but is a lot of extra effort to deal with something that is mostly fixed by interleaving.
Hopefully these comments help to present a clear picture of why this feature could be extremely useful.
PS I run zfs even on a laptop and it can really lead to less consistent with benchmarks that have any sort of file io even with prepare dropping caches.
Oh, one extra note, I think standard error is much more important than standard deviation in these cases. For example, if you have a script that takes 20 millisecond to run with a standard deviation of 5 ms, and another script that only takes 18 ms, but the standard deviation is also 5 ms, looking at mean + standard deviation will never tell you which is faster. That being said, after enough runs the stardard error will trend towards zero making it clear that the 18 ms script is actually faster. We may want to also expose standard error, because it will help to properly distinguish whether the difference was just due to noise and regular deviation, or if the results were meaningfully different.
Oh, one extra note, I think standard error is much more important than standard deviation in these cases. For example, if you have a script that takes 20 millisecond to run with a standard deviation of 5 ms, and another script that only takes 18 ms, but the standard deviation is also 5 ms, looking at mean + standard deviation will never tell you which is faster. That being said, after enough runs the stardard error will trend towards zero making it clear that the 18 ms script is actually faster. We may want to also expose standard error, because it will help to properly distinguish whether the difference was just due to noise and regular deviation, or if the results were meaningfully different.
Thanks for bringing this up. I have thought about this often, and I was never sure which would be better to show. Both values are very useful. The problem is that most users probably don't know the difference very well.
But you are absolutely right, of course. A command with 18 ± 5 ms can be faster than 20 ± 5 ms in a statistically significant way, but there is no way to tell from the current output. We do, however, ship a scripts/welch_ttest.py
Python script that performs a statistical test on exported benchmark results to answer this particular question: is command A actually faster than command B?
We also have a scripts/advanced_statistics.py
script where we could very easily include the standard error.
But I'm also open to discussions whether or not this should be in the default output of hyperfine.
I have run into issues with benchmarks that are very sensitive to minor condition changes, but have clear speed difference when run correctly. For example, I have done some benchmarking of compiling rather small files and also doing linking. That does a solid amount of file IO and will always have a decent amount of deviation. I have found that some of these cases, If the benchmarks are run non-interleaved, I will have to do something like
-w 100 -m 1000
to get accurate results of which process is faster. This will get accurate results but will take a while to run. Since I am only interested in which case is faster, I don't really care about getting exact results (just clear performance indications). As a result, at multiple occasions, I have hacked together python scripts that time the results with interleaving. Even with a relatively small number of runs equivalent to something like-w 10 -m 50
, I am able to rather consistent indications of which script is faster and approximately by how much. The important thing is that both test runs get hit pretty evenly by all of the performance issues (scheduling, cpu temp, disk io from anything else on the machine, etc).
It would be great if you could perform such a benchmark in a naive kind of way and share the results. Something like
hyperfine --runs 1000 --export-json benchmark.json "cmd1" "cmd2"
hyperfine --runs 1000 --export-json benchmark_reversed.json "cmd2" "cmd1"
I would love to take a closer look at the results and the progression of the runtimes.
Ok, so here is an example of a command that timing can be a nightmare for. So this command over a very long term average, will run in about 18ms. If you look at the sample timing that I posted, the timing average until line 646 is approximately 18ms, but after line 646, the average looks to be closer to 12ms.
I am almost 100% certain the timing issues are related to the amount of mmapping and file io done, but I am specifically trying to time the command with a warm cache. This command is essentially never used with a cold cache and those timings are not helpful. When timing this command against other variants of itself, or similar commands, they will generally run into the same sorts of crazy timing jumps. Generally speaking, when interleaved, both versions get timing peaks and valleys at the same time. Hopefully this gives enough context even if it isn't exactly what you asked for.
I just accidentally recorded an even better example.
I wanted to test 3 versions of an executable against a different command. I expected hyperfine would run: version1, version2, version3, different command
. Instead, it ran: version1, different command, version2, different command, version3, different command
because that is how parameters work (not a bug, just my mistake). The different command didn't use the parameters at all.
Anyway, the different command took:
9.2 ms ± 0.2 ms
the first time running9.7 ms ± 1.9 ms
the second time running15.2 ms ± 3.7 ms
the third time runningDepending on which run I look at, it could be faster or about the same speed as most of the other commands I was testing. Hopefully my wording made sense here.
Very interesting. Thank you very much. Here is how a histogram of the first benchmark looks like (python scripts/plot_histogram.py timings.json --bins=300
). It's not just two distinct peaks, but at least four(!):
Could you please tell us a bit more about the command you are running and maybe the system that you are running it on (if possible)? What kind of background processes are running (browser, spotify, dropbox)?
Also related: https://github.com/sharkdp/hyperfine/issues/481
Thanks for pointing out the 4 peaks. I am used to this having 2 peaks but not 4. Turns out that a daemon I had running in the background was turning turbo on and off. That is what was leading to the much larger distribution swing. That also makes sense as to why interleaving would stabilize the issue. Turbo mode would be on for plenty long enough that all the commands would execute multiple times before turning back off. So all commands would see the same highs and lows from turbo mode.
Anyway, this still leaves me with 2 peaks that cause problems when comparing multiple versions of this command. I just collected a much longer trace with turbo mode always on. This recording was done, in performance mode, with turbo always on, after a restart (and minor delay), with the internet off, and only a terminal + htop open. I scanned through all of the running processes, and nothing looks more interesting than what would be found in a normal ubuntu install. Generaly speaking, the terminal windows had the highest CPU usage at about 2% because this app is so short running and does a lot of file io. Extra note about my setup, I run ZFS as my file system, which I am sure creates a more complex caching and file io scenario, but I am not sure if it is the root cause for the deviation.
For more context on the command, this is compiler and linker running. So the base process is about what would be expected: load source files, process them until you have assembly, write that to the file system, do linking to get an executable. The language is a bit weird because it embeds into a binary compiled by another language. Due to this property, we have our own custom linker that surgically appends our output object on to an existing binary updating what is needed. As such, we do a lot of file io and memory mapping: Copy preprocessed binary to output location, mmap it, load a metadata file, mmap our output bytes, copy the bytes over, do a bunch of random updates across the file to make sure it is ready, and finally dump it to disk.
We also can run with a traditional linker. There are the timings from running with a regualar linker. Looking at these timings, they are much much smoother (though maybe still bimodal?). The regular linker actually does much cleaner file io due to not mmaping and writing to disk linearly after building in memory.
Given how much more bimodal the surgical linker is, I would posit that all of the memory mapping and potentially random file io is causing the clear peaks, but that is really just a guess.
Very interesting. This is how all 10,000 benchmark runs (except for a few slow outliers outside of the y-axis range) look like in sequence:
We can clearly see the bimodal distribution. We can also see that the pattern stays constant over time. Except for a few interferences(?) around n=1550 and n=8700.
For comparison, this is how the regular linker benchmark looks like:
And for completeness, here are the two histograms:
Given that the bimodal pattern stays constant over time, I actually don't really see why an interleaved benchmark execution would help here. Maybe that's just the "true" runtime distribution (for whatever reason)?
So I know that it looks like a consistent bimodal distribution, but there is a correlation between points that are next to each other. If you look at the timings and imagine 2 different scenarios. The first scenario (non-interleaved) is that the first 5000 points belong to one command and the last 5000 points belong to the other command. The second scenario (interleaved) is that the even points belong to one command and the odd points belong to the other command.
Here is the graph for the non-interleaved scenario:
Here is the graph of the interleaved scenario:
Looking at the 2 charts. For the non-interleaved scenario, the orange bars slightly trail the blue bars, while in the interleaved scenario, they are essentially even.
To be more specific on numbering. In the non-interleaved scenario, the second half is slower than the first half by 0.5 ms with a standard error of 0.13 ms (outside of the stand error and theoretically significant). In the interleaved scenario, the results are essential even with a difference of 0.02 ms with a standard error of 0.12 ms (well within the standard error).
Note: this holds true even when removing the first 2000 points as a warmup.
I see, so there's a lot going on here :smile:. I'm not sure if we can trust non-robust estimators like mean and standard deviation here. The outliers have too large of an effect: here is a moving average view of all benchmark runs with a window size of num_runs = 1000 (note that I'm excluding the boundary regions here):
The plateaus in the data come from outliers way outside the visible range:
So these three bursts of outliers could also lead to the imbalance between the first and the second half in terms of mean run time.
If we look at robust estimators (median and median absolute deviation / MAD), the picture looks a bit different:
>>> def mad(X): return np.median(np.abs(X-np.median(X)))
...
>>> np.median(times[:5000]), mad(times[:5000])
(0.01861407567, 0.0008725065000000001)
>>> np.median(times[5000:]), mad(times[5000:])
(0.01871475567, 0.0009133250000000013)
Using MAD/sqrt(N) as a robust estimator for the "standard error", we then find:
first half: 18.61 ± 0.01 ms
second half: 18.71 ± 0.01 ms
which still indicates a (statistically significant) difference of 0.1 ms between first and second half.
Wow! Thanks for the detailed analysis. It is really helpful to see.
I guess this leaves me with the question: what is the correct plan of action given all of this information?
Interleave and use the minimum measurement. :)
I don't think that would work well. I think my example falls into the same category as "networking" in the talk. The effects of file io matter to the performance. Just taking the minimum would remove a lot of the characteristics of the program that are extremely important. But interesting idea for certain classes of benchmarks. Also, I don't think interleaving matters if you are just taking the minimum.
I think my example falls into the same category as "networking" in the talk. The effects of file io matter to the performance.
If you are doing I/O on a filesystem backed by a physical storage device, you are pulling in a number of very complex systems into the benchmark. I would suggest testing on an isolated tmpfs.
Just taking the minimum would remove a lot of the characteristics of the program that are extremely important.
I agree that taking the minimum doesn't make a lot of sense if the tested code is nondeterministic. One such cause of nondeterminism is state which outlasts the benchmarked program's execution, which would include the state of the filesystem (in RAM and on disk), OS mechanisms like caching layers, and any backing physical devices. I think such a benchmark would overall just not be very meaningful, unless each invocation started from a clean state (i.e. the filesystem is created and mounted from scratch every time).
Also, I don't think interleaving matters if you are just taking the minimum.
It can matter in case of a lag spike which would cover one benchmarked command's non-interleaved execution.
I would suggest testing on an isolated tmpfs.
I definitely agree that can help, but I don't think that is a solution I want to take. The issues is that the complexities of the file system are huge part of the performance here. If those complexities are remove, the results are not accurate. Linking is essentially all file io. Yes, it does some other stuff and may even generate the final binary in memory before dumping, but there is no substantial CPU work. As such, taking advantage of a tmpfs can totally distort the performance picture.
If you would like to include measuring the performance of the filesystem implementation as part of your benchmark, I would suggest to use a clean filesystem backed by RAM, fresh for every test.
If you would like to also include the physical storage device as part of your benchmark, I would suggest performing an ATA reset, and TRIM / ATA secure erase (or the NVMe equivalents) for flash devices, for every test.
Otherwise, each iteration depends on the previous iteration's state, plus any undefined initial state, plus interference with all other processes which interact with these systems. This just doesn't make for a very meaningful measurement - the results will be effectively unreproducible.
So I decided to dig into this for my application just to better understand what really causes the performance variations. Turns out that there are 3 core distributions to the execution speed of my applications:
To my surprise, running on a tmpfs backed by memory vs on zfs backed by ssd made very minimal differences to any of these distributions.
These results make me really wish syncing and dropping caches wasn't so slow. I think that "lightly" cached scenario would be a good target for general benchmarking of this app, but that takes astronomically longer to run than not clearing caches at all only to record a slightly cleaner result. I might just have to do some sort of processing of the data to remove one of the peaks.
Thank you for the feedback on this.
These results make me really wish syncing and dropping caches wasn't so slow.
How many runs do you perform? I would hope that 10-100 would be enough for reasonable statistics. Especially if you "control" the caching behavior. Not >1000 like you did for the full histograms above.
Do you know what the reason for the "lightly" and "heavily" cached scenarios is?
Why do we see the "lightly"-cached peak in the histograms at all? Shouldn't we mostly see the "heavily" cached scenario?
I apparently didn't save my timings. Probably put them in the git directory and later cleared them, but I still have some clarifications.
First, this is what 100 samples heavily cached looks like. You can see the rough form of the relative ratio of the 2 peaks: It takes about 8 and a half minutes to generate this graph with a warmup of 10 samples. In that time, I can generate approximately 50,000 samples with 5,000 warmup samples if I am not clearing the cache. So, that is what I mean by it is really slow. Though I guess the main peak is in the right spot, so maybe it is valid to run less samples.
As an extra note, this is approximately what my graph with 2000 samples looked like: Hopefully the graph explains the rough idea of what I meant by longer tail at the "lightly" cached peak.
As for the why. No clear idea, sadly. My only guess is maybe the files that are loaded directly get cached differently then the files that get mmaped. So "lightly" cached may only deal with files being cached and "heavily" cached may be when both files and mmaps are cached.
So, did some more digging and found the cause of the bimodal distribution. It actually isn't cause by reading caching at all. It is caused by synchronously flushing an mmap to disk. Apparently that is bimodal.
If I change from a synchronous flush at the end of linking to an asynchronous flush, the bimodal distribution goes away. This sadly doesn't not quite seem to be a full fix in my case because it adds a weird behavior where there is about a 1 in 1000 chance that the execution will freeze for about 4 seconds. I assume this is due to the app trying to close before the mmap has finished flushing. No idea why that would freeze for so long, but probably a kernel problem of some sort.
Even when the mmap is of a RAM-backed file? That's interesting, maybe you could dump /proc/<pid>/stack
to see why it's taking so long when it's taking over a millisecond. Same with the 4-second delay, perhaps it might be due to accumulated writes that are being flushed all at once.
So, I just had the realization that on my current machine /tmp
isn't a tmpfs. So all of my test that I thought were ram backed, weren't. With a properly RAM-backed file, this is also fixed. Thanks so much for all of the help even if it ended up not really being hyperfine related.
Same with the 4-second delay, perhaps it might be due to accumulated writes that are being flushed all at once.
This looks to also be correct. I added -p 'sync output_file'
and the giant 4 second long flushing no longer happens.
Thanks a ton for helping me dig through this. Really goes to show how many things can mess up benchmarking.
I'm also in the laptop thermal throttle camp.
I have an ugly little workaround I use to avoid this issue. It works by using named pipes in the setup/prepare/cleanup functions to coordinate two hyperfine
instances to take alternating turns, each waiting for the other to complete its next run:
# process "a"
hyperfine -r10 -s 'mkfifo a b || true; cat a' -p '>b; cat a' 'sleep 0.5'
# process "b" (from same working dir as above)
hyperfine -r10 -p '>a; cat b' -c '>a' 'sleep 0.2'
It only works if the number of iterations is fixed and equal on both sides.
There are probably nicer/fancier ways to do this, but the above solution works for me.
CyberShadow on HN: