Increase default running time per process?

kmod commented 2 years ago

Using the default settings, pyperf aims to run 20 worker processes for ~600ms each. Or for implementations that are noted as having jits, 6 processes for 1600ms each.

Is there a strong reason for running so many subprocesses for such a short amount of time? It looks like the results are aggregated and process-to-process comparisons are dropped. 600ms/1600ms is a short amount of time when it comes to jit warmup and in my view doesn't quite reflect the typical experience that users have.

I'd like to propose a new set of numbers, such as 3 worker processes for 4s each. (I'd even be in support of 1 worker process for 12s.) I'd also like to propose using this configuration regardless of whether the implementation has a jit, since I put a higher weight on consistency than using more processes when possible.

What do you all think? I'm also curious what the cinder folks think, I saw @Orvid comment about this in https://github.com/facebookincubator/cinder/issues/74#issuecomment-1128033247

vstinner commented 2 years ago

On micro-benchmarks (values less than 100 ns), each process has different performance depending on many things: environment variables, current working directory, adress space layout (which is randomized on Linux: ASLR), Python random hash seed (indirectly change the number of collisions in hash tables), etc.

pyperf is the result of my research of benchmarking: https://vstinner.github.io/category/benchmark.html

pyperf is not turned for JIT compilers. I tried but failed to implement R changepoint in pyperf to decide when a benchmark looks "steady". I stopped my research at: https://vstinner.readthedocs.io/pypy_warmups.html

Sadly, it seems like nobody tried to tune pyperformance for PyPy so far. PyPy still uses its own benchmark suite and its own benchmark runner.

If you want to change the default parameters, can you please prove that it has a limited or no impact on reproducible results? My main concern is getting reproducible results, not really to run benchmarks fast. But I'm also annoyed that a whole run of pyperformance is so slow. Reproducible means that for example that if you run a benchmark 5 times on the same machine but reboot the machine between each run, you get almost the same values (mean +- std dev).

For that, I like to use "pyperf dump" and "pyperf stats", to look at all values, not just the mean and std dev.

On the other side, I'm perfectly fine to have different parameters for JIT compilers. pyperf already has heuristics only enabled if a JIT compiler is detected. Currently, it's mostly about computing the number of warmups in the first (and maybe second) worker process.

vstinner commented 2 years ago

Ah, also, I don't want to be the gatekeeper of pyperf, I want it to be useful to most people :-) That's why I added co-maintainers to the project: @corona10 and @pablogsal who also care about Python performance.

vstinner commented 2 years ago

I'd like to propose a new set of numbers, such as 3 worker processes for 4s each.

On CPython with CPU isolation, in my experience, 3 values per process (ignoring the first warmup) are almost the same. Computing more values per process wouldn't bring much benefits.

If you don't use CPU isolation, it can be different. With a JIT compiler, it's likely very different. Also, Python 3.10 optimizes LOAD_ATTR if you run a code object often enough. Python 3.11 optimizes way more opcodes with a new "adaptative" bytecode design. So last years, CPython performance also started to change depending on how many times you run a benchmark. It may also need more warmups ;-)

corona10 commented 2 years ago

@vstinner @kmod I have a neutral stand on this proposal. But as @vstinner commented, pyperf should not be tuned for the specific implementation. I fully understand that pyston projects want to show the best performance but look like pyperf project does not consider pyston project situation.

IMO, users should know that JIT implementation needs warmup time, it should be also measurable and seeable to end-users through a benchmark. so I would like to suggest the following things.

Providing parameters options to consider their own warm-up time.
pyperf measures the benchmark into two sections: before-warm-up / after-warm-up and also the pyperf should show how much warmup time was needed as the result.
If the implementation does not support JIT, both sections would be measured in the same execution engine (CPython might be this case now)
By doing this we can measure both before-warm-up performance and after-warm-up performance and also how many warmup time is needed for the specific implementation.
I don't have ideas yet that what metric should be used for the before-warm-up period because the period is unstable. Maybe @markshannon has ideas for measuring the before-warm-up period.

WDYT?

vstinner commented 2 years ago

Very important paper in this field: https://arxiv.org/abs/1602.00602 "Virtual Machine Warmup Blows Hot and Cold" (2017).

markshannon commented 2 years ago

The number of times we run a benchmark and the duration of run should be independent. The reason for running a benchmark multiple times is to get stable results. How long the individual runs are shouldn't matter, as long as the results are stable enough.

We do want some form of inter-process warmup (compiling pyc files, warming O/S file caches, etc) as that reduces noise, but allowing some VMs a free "warmup" time is nonsense.

We can have benchmarks of varying lengths. For example, three different web-server benchmarks: one that serves 10k requests, one that serves 100k request, and one that serves 1M requests (or something like that). Stretching out the runtime of benchmark by looping or discounting "warmup" is deliberately misleading, IMO.

I agree with @kmod that many (all?) of the pyperformance benchmarks do not reflect user experience. The solution is to have better benchmarks, not to fudge the results.

If a JIT compiler has a long warmup, but is fast in the long run, we should show that, not just say it is fast.

kmod commented 2 years ago

@markshannon So to be clear, pyperf already treats jit-implementations different than non-jit ones, and I am advocating for getting rid of this distinction. I think a single set of numbers should be chosen, and personally I think the jit numbers (or higher) should be chosen, but I think choosing the non-jit numbers for everyone would also be an improvement.

Also I could have been more clear -- my proposal doesn't change the number of samples collected or the length of each sample, just the number of processes that those samples are spread across. Also for what it's worth pyperf already gives each process a short warmup period.

@vstinner I disagree that reproducibility is the primary concern of benchmarking, because if true then "return 0" would be an ideal benchmarking methodology. The current interest in benchmarking is coming from wanting to explain to users how their experience might be changed by switching to a newer python implementation; I don't think users really care if the number is "15% +- 1%" vs "15% +- 0.1%", but they would care if the real number is actually "25% +- 1%" because the benchmarking methodology was not representative of their workload. ie I think accuracy is generally more important than precision, and that's the tradeoff that I'm advocating for here. I could see the argument "python processes run on average for 600ms so that's why we should keep that number" but personally I believe that that premise is false.

Maybe put another way: I think everything that's been said on this thread would also be an argument against me proposing increasing the runtime to 600ms if it were currently 300ms. So this thread seems to be implying that we should actually decrease the amount of time per subprocess? For what it's worth, I believe pyperf's per-process execution time is a few orders of magnitude smaller than what everyone else does, which is suggestive of increasing it.

psf / pyperf

Increase default running time per process? #136