Statement level performance

Within the high level programming language, python is very flexible and easy to write. In some situations, there may be different kinds of ways to do the same thing, and the performance are variant with resepect to the ways. How can we measure the performance and choose the right way?

In the following, I will detail this in two steps. First, I will do some benchmarks. Second, do some analysis about the benchmark results.

Benchmark

Benchmark tool

In python, we can easily use timeit to do statement level performance analysis, which is a builtin module in the CPython core library.

Benchmark function

empty

def empty():
    pass

load global attribute

def load_str():
    str

load local attribute

def load_str_local(str=str):
    str

builtin format string

def f_str(v):
    f'{v}'

call builtin str function

def convert_str(v):
    str(v)

call builtin str function from local

def convert_str_local(v, str=str):
    str(v)

call python str function from local

def py_str(v):
    return str(v)

def convert_py_str_local(v, str=py_str):
    str(v)

Benchmark result

empty function

88 ns ± 1.36 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

load global attribute

113 ns ± 0.373 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

load local attribute

99.3 ns ± 4.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

builtin format string

110 ns ± 7.85 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

call builtin str function

251 ns ± 7.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

call builtin str function from local

213 ns ± 6.29 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

call python str function from local

346 ns ± 11.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

I run this benchmarks with Python3.7.1 on my local macOS machine. The duration depends on the software version and machine speed. So it's ordinary to see different results. Now I can give some conclusions as follow,

calling empty function still has overhead
local attribute retrive is faster than the global
sometimes instruction level operation is faster.
stack frame has much overhead.

Moreover, I could step into some details.

Inspection tool

With the above benchmark result, we have had some duration numbers at hands. But we still can't figure out what happened in the above situation. Fortunately, there is another useful tool dis, which is a disassemble module and can tell us what the python code will do at the virtual machine level. With dis module, we can step futher. Let's do it.

code result

empty function

  2          0 LOAD_CONST               0 (None)
              2 RETURN_VALUE

load global attribute

  2          0 LOAD_GLOBAL              0 (str)
              2 POP_TOP
              4 LOAD_CONST               0 (None)
              6 RETURN_VALUE

load local attribute

  2          0 LOAD_FAST                0 (str)
              2 POP_TOP
              4 LOAD_CONST               0 (None)
              6 RETURN_VALUE

builtin format string

  2          0 LOAD_FAST                0 (v)
              2 FORMAT_VALUE             0
              4 POP_TOP
              6 LOAD_CONST               0 (None)
              8 RETURN_VALUE

Call builtin str function

  2          0 LOAD_GLOBAL              0 (str)
              2 LOAD_FAST                0 (v)
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE

Call builtin str function from local

  5          0 LOAD_FAST                1 (str)
              2 LOAD_FAST                0 (v)
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE

call python str function from local

  5          0 LOAD_FAST                1 (str)
              2 LOAD_FAST                0 (v)
              4 CALL_FUNCTION            1
              6 POP_TOP
              8 LOAD_CONST               0 (None)
             10 RETURN_VALUE

Here I won't detail more about the disassembled result, rather to compare the corresponding bytecodes and get some intuitions.

LOAD_FAST is faster than LOAD_GLOBAL
FORMAT_VALUE does greate job
CALL_FUNCTION suffers the stack frame overhead.

Conclusion

From the above experience, it's easy and obvious to use timeit and dis to measure statement level performance. The following is the summarization,

overhead

access global attribute repeatedly
call python function overhead
execution stack frame overhead

tool

timeit
dis

Reference

timeit
dis

universe-proton / universe-topology

The python statement level performance analysis. #18