Unreliable results for identical code

jaraco commented 3 years ago

In python/importlib_metadata#294, I'm employing pyperf to measure the performance difference between the locally-checked out code and the main branch. To my dismay, pyperf reports significant variance between the runs. By design, the commands run are identical and the python executable is identical. The only difference is which copy of the code is installed into each environment (but the code is identical since the PR is based on what was main at the time and there are no code changes).

One variable is that since both commands are run in .tox, the first run will see ./perf in the current directory and the second run will see ./perf and ./perf-ref in the current directory, but that seems unlikely to affect the performance of metadata handling.

Historically, I've had good reliability with simply python -m timeit. I expected to get similar reliability from pyperf.

You can replicate the results by checking out the project and running tox -e 'perf{,-ref}'.

I'm really excited about the prospect of relying on pyperf to provide the comparison reports of main against local, but right now my optimism is dashed.

I skimmed through the tuning docs, but they seem highly platform-specific and doesn't explain why pyperf timeit is getting jittery results where timeit does not.

It's quite possible, even likely, that I'm missing something obvious, so I humbly ask for advice. Is there any easy way to make the performance results more stable in a cross-platform way?

vstinner commented 3 years ago

I skimmed through the tuning docs, but they seem highly platform-specific and doesn't explain why pyperf timeit is getting jittery results where timeit does not.

Well, measuring performances is a hard problem. timeit hides problems since it only measures 5 values and always return the minimum. pyperf shows you the hard truth: real numbers :-) I suggest you digging into https://pyperf.readthedocs.io/en/latest/analyze.html and attempt to see what's going on with your benchmark.

For example, if you get a bi-modal distribution, is it become some processes ("runs") are always slower and some processes are always faster? pyperf doesn't explain anything, it only provides you the tooling so you can investigate :-)

If you don't like the plain and cold truth, use timeit ;-)

jaraco commented 3 years ago

I see. Well, I do get a lot of benefits from pyperf that aren't in timeit. I enumerated some in python/importlib_metadata#305. But it sounds like pyperf is happy with what it provides, inadequate as it is for my use-case, so I'll close this issue. Thanks for the feedback.

psf / pyperf

Unreliable results for identical code #106