Open sporksmith opened 2 years ago
Using https://stackoverflow.com/a/64863392 as an example, I tested using perf_event_open to count instructions executed.
It looks like this could work, with a couple caveats:
/proc/sys/kernel/perf_event_paranoid
My test program measured the number of instructions taken to execute a function call, and then measured it again in a loop until it got a different count from the original count. Unfortunately it does seem to occasionally get different counts:
$ sudo ./perf_event_open.out
First execution used 18 instructions
Iteration 861 got 19 instead of 18
$ sudo ./perf_event_open.out
First execution used 18 instructions
Iteration 230 got 19 instead of 18
$ sudo ./perf_event_open.out
First execution used 18 instructions
Iteration 3661 got 19 instead of 18
$ sudo ./perf_event_open.out
First execution used 18 instructions
$ sudo ./perf_event_open.out
First execution used 18 instructions
Iteration 775 got 19 instead of 18
Here's the code:
#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>
#include <errno.h>
#include <inttypes.h>
#include <sys/types.h>
static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags)
{
int ret;
ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
return ret;
}
long long f(int fd, int n) {
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
/* Loop n times, should be good enough for -O0. */
__asm__ (
"1:;\n"
"sub $1, %[n];\n"
"jne 1b;\n"
: [n] "+r" (n)
:
:
);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
long long count;
read(fd, &count, sizeof(long long));
return count;
}
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
int fd;
uint64_t n;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 10000;
}
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HARDWARE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_INSTRUCTIONS;
pe.disabled = 1;
pe.exclude_kernel = 1;
// Don't count hypervisor events.
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening leader %llx: %s\n", pe.config, strerror(errno));
exit(EXIT_FAILURE);
}
const long long first_count = f(fd, 1);
printf("First execution used %lld instructions\n", first_count);
for (uint64_t i =0; i <n; ++i) {
const long long count = f(fd, 1);
if (count != first_count) {
printf("Iteration %ld got %lld instead of %lld\n", i, count, first_count);
exit(1);
}
}
close(fd);
}
There's another counter for context switches. It might be interesting to check whether the discrepancies happen during measurements where there was one. If so, that might be a solvable bug in the kernel. The API is supposed to only count instructions executed by the current thread in user-space, but I imagine this could be tricky to get right.
From perf_event_open(2)
Be careful, these can be affected by various issues, most notably hardware interrupt counts.
So, seems unlikely we can get a deterministic count.
As in the current implementation we could maybe make it "closer" to deterministic by rounding at some granularity, but we'd still have nondeterminism when the "true" measurement is near the edge of the rounding granularity.
This could still be a useful feature, but I think enabling it would necessarily give up determinism.
For completeness I guess I should mention it should be possible to get a deterministic instruction count by emulating the CPU, e.g. by using valgrind or pin, but this would introduce quite a bit of complexity and performance cost.
I suspect that in most cases, a cheaper but nondeterministic measurement would be more useful.
I used to work near a VM record/replay implementation that relied on using perf counters deterministically, and while I was digging for references related to that I found something even better: mozilla implemented a very similar approach in userspace, using retired branch counts. This is the best documentation I've seen on the approach so far.
https://robert.ocallahan.org/2014/03/introducing-rr.html https://arxiv.org/pdf/1705.05937.pdf https://rr-project.org/ https://github.com/rr-debugger/rr/blob/master/src/counters-test/counters.c
@scanlime thanks! Yeah I agree it's worth taking a closer look at rr's approach. It's definitely nice to have an option more deterministic than the other performance counters.
Executing instructions on the emulated CPU currently consumes 0 time in a Shadow simulation. This can lead to inaccuracy in applications that are CPU-intensive.
It can also lead to deadlock in applications with busy loops / spin loops (#1792), though modeling time taken by syscalls and VDSO calls is probably sufficient for most instances of that issue.
Ideally we'd like to model CPU time in a way that preserves determinism, but optional features that model CPU time at the cost of determinism would also be helpful.