Direct variable reference as function argument runs faster?

justdoit0823 commented 7 years ago

Always we may call a lot of functions in Python programs, and there may be a simple hesitate whether we should pass object's attribute value as function argument or pass a local variable which refers to the original value. It's not only about program performance, but the coding style as well. So I want to make a test on this situation.

The following is a simple test program.


import click
import math
import time

@click.group()
def main():
    pass

class A:

    x = 1231

def direct_call(a_obj):
    return a_obj.x * 2

def indirect_call(a_obj):
    x = a_obj.x
    return x * 2

@main.command('run')
@click.argument('direct', type=int, default=0)
@click.argument('loop', type=int, default=100)
@click.argument('count', type=int, default=10000)
def run(**kwargs):
    direct = kwargs['direct']
    loop = kwargs['loop']
    count = kwargs['count']

    if direct:
        run_direct_reference(loop, count)
    else:
        run_indirect_reference(loop, count)

def benchmark(loop, count, func, time_func):
    t_s_time = time_func()
    durations = []
    for idx in range(loop):
        s_time = time_func()

        for c_idx in range(count):
            func(A)

        e_time = time_func()
        durations.append(e_time - s_time)

    t_e_time = time_func()
    print(
        'test {0} loops and iterate count {1}, max time {2}s, avg time {3}s, '
        'min time {4}s, total run time {5}s'.format(
            loop, count, max(durations), sum(durations) / loop, min(durations),
            (t_e_time - t_s_time)))

def run_direct_reference(loop, count):
    benchmark(loop, count, direct_call, time.time)

def run_indirect_reference(loop, count):
    benchmark(loop, count, indirect_call, time.time)

if __name__ == '__main__':
    main()

And here are the results.

Direct reference

test 100 loops and iterate count 1000000, max time 0.2771177291870117s, avg time 0.20914509773254394s, min time 0.1944730281829834s, total run time 20.91534113883972s
test 100 loops and iterate count 1000000, max time 0.25256776809692383s, avg time 0.2039675760269165s, min time 0.18776655197143555s, total run time 20.39756941795349s
test 100 loops and iterate count 1000000, max time 0.26304054260253906s, avg time 0.20296960353851318s, min time 0.1906893253326416s, total run time 20.297788381576538s
test 100 loops and iterate count 1000000, max time 0.2738807201385498s, avg time 0.21081802129745483s, min time 0.18758440017700195s, total run time 21.08248805999756s
test 100 loops and iterate count 1000000, max time 0.2603030204772949s, avg time 0.1936710500717163s, min time 0.1795177459716797s, total run time 19.367736101150513s
test 100 loops and iterate count 1000000, max time 0.24562525749206543s, avg time 0.20933830738067627s, min time 0.19553494453430176s, total run time 20.93464183807373s
test 100 loops and iterate count 1000000, max time 0.24544525146484375s, avg time 0.20130863904953003s, min time 0.18932890892028809s, total run time 20.131604433059692s
test 100 loops and iterate count 1000000, max time 0.2830989360809326s, avg time 0.20246000289916993s, min time 0.18879389762878418s, total run time 20.24679183959961s
test 100 loops and iterate count 1000000, max time 0.25501537322998047s, avg time 0.19920070409774782s, min time 0.18275785446166992s, total run time 19.92082667350769s
test 100 loops and iterate count 1000000, max time 0.23061442375183105s, avg time 0.2016540765762329s, min time 0.18910765647888184s, total run time 20.16617774963379s

Indirect reference

test 100 loops and iterate count 1000000, max time 0.26860737800598145s, avg time 0.21192786931991578s, min time 0.19990801811218262s, total run time 21.193568468093872s
test 100 loops and iterate count 1000000, max time 0.2477259635925293s, avg time 0.19287943124771117s, min time 0.1872403621673584s, total run time 19.288273096084595s
test 100 loops and iterate count 1000000, max time 0.25309062004089355s, avg time 0.21047576427459716s, min time 0.1984086036682129s, total run time 21.0483660697937s
test 100 loops and iterate count 1000000, max time 0.24587202072143555s, avg time 0.2007625937461853s, min time 0.18789291381835938s, total run time 20.076935052871704s
test 100 loops and iterate count 1000000, max time 0.26269960403442383s, avg time 0.20238703727722168s, min time 0.1886892318725586s, total run time 20.23936438560486s
test 100 loops and iterate count 1000000, max time 0.26854991912841797s, avg time 0.2010058069229126s, min time 0.19346213340759277s, total run time 20.10107970237732s
test 100 loops and iterate count 1000000, max time 0.24597835540771484s, avg time 0.2116751766204834s, min time 0.19957304000854492s, total run time 21.168314218521118s
test 100 loops and iterate count 1000000, max time 0.35762667655944824s, avg time 0.20729455947875977s, min time 0.19134306907653809s, total run time 20.73020911216736s
test 100 loops and iterate count 1000000, max time 0.26679420471191406s, avg time 0.2129973030090332s, min time 0.18769025802612305s, total run time 21.300573110580444s
test 100 loops and iterate count 1000000, max time 0.25857067108154297s, avg time 0.1991182065010071s, min time 0.19130969047546387s, total run time 19.91228675842285s

I run this test on an idle linode machine with environment Linux localhost 4.9.15-x86_64-linode81 #1 SMP Fri Mar 17 09:47:36 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux. Their running duration are almost the same.

When I dig into this two conditions, the direct reference should run a bit faster in theory. However it's not clear to see.

Here is the bytecode of direct reference call.

In [6]: dis.dis(direct_call)
  2           0 LOAD_FAST                0 (a_obj)
              2 LOAD_ATTR                0 (x)
              4 LOAD_CONST               1 (2)
              6 BINARY_MULTIPLY
              8 RETURN_VALUE

And here is the bytecode of indirect reference call.

In [7]: dis.dis(indirect_call)
  6           0 LOAD_FAST                0 (a_obj)
              2 LOAD_ATTR                0 (x)
              4 STORE_FAST               1 (x)

  7           6 LOAD_FAST                1 (x)
              8 LOAD_CONST               1 (2)
             10 BINARY_MULTIPLY
             12 RETURN_VALUE

There are two instructions 4 STORE_FAST 1 (x) and 6 LOAD_FAST 1 (x) than direct reference call. An additional variable should be temporarily used and loaded as function's argument.

From the test result, we may see that it doesn't matter which way you use to write the code if you care about performance. But I think it's more readable with an additional variable and pass by as indirect reference.

Oh, a coding style war is coming.

harveyqing commented 7 years ago

At least for me, it just depends on the complexity of the intermedia expressions.

For complex expressions, the use of intermedia variables makes logic simple and clear.

justdoit0823 commented 7 years ago

Yes, there may be another situation we may refer function arguments in later program. This can generate more LOAD_ATTR instructions which is a bit more expensive operation. And I will use indirect reference in this situation. Can anyone give a detailed test?

justdoit0823 commented 7 years ago

According to masklinn, there are some jitters in the aboving test program, and I should write the test program with perf. The following are the details.

direct_bench.py


import perf

class A:

    x = 1231

def direct_call(a_obj):
    return a_obj.x * 2

def main():
    runner = perf.Runner()
    runner.bench_func('run direct benchmark', direct_call, A)

if __name__ == '__main__':
    main()

indirect_bench.py


import perf

class A:

    x = 1231

def indirect_call(a_obj):
    x = a_obj.x
    return x * 2

def main():
    runner = perf.Runner()
    runner.bench_func('run indirect benchmark', indirect_call, A)

if __name__ == '__main__':
    main()

Here are the perf stats on the same machine.

direct_bench stat detail

Total duration: 12.4 sec
Start date: 2017-08-06 07:44:31
End date: 2017-08-06 07:44:45
Raw value minimum: 124 ms
Raw value maximum: 182 ms

Number of calibration run: 1
Number of run with values: 20
Total number of run: 21

Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 2^19
Total number of values: 60

Minimum:         236 ns
Median +- MAD:   266 ns +- 26 ns
Mean +- std dev: 264 ns +- 26 ns
Maximum:         347 ns

  0th percentile: 236 ns (-11% of the mean) -- minimum
  5th percentile: 236 ns (-11% of the mean)
 25th percentile: 238 ns (-10% of the mean) -- Q1
 50th percentile: 266 ns (+0% of the mean) -- median
 75th percentile: 279 ns (+5% of the mean) -- Q3
 95th percentile: 310 ns (+17% of the mean)
100th percentile: 347 ns (+31% of the mean) -- maximum

Number of outlier (out of 177 ns..339 ns): 1

indirect_bench stat detail

Total duration: 13.0 sec
Start date: 2017-08-06 07:45:30
End date: 2017-08-06 07:45:45
Raw value minimum: 128 ms
Raw value maximum: 182 ms

Number of calibration run: 1
Number of run with values: 20
Total number of run: 21

Number of warmup per run: 1
Number of value per run: 3
Loop iterations per value: 2^19
Total number of values: 60

Minimum:         244 ns
Median +- MAD:   281 ns +- 18 ns
Mean +- std dev: 278 ns +- 28 ns
Maximum:         347 ns

  0th percentile: 244 ns (-12% of the mean) -- minimum
  5th percentile: 245 ns (-12% of the mean)
 25th percentile: 246 ns (-11% of the mean) -- Q1
 50th percentile: 281 ns (+1% of the mean) -- median
 75th percentile: 293 ns (+5% of the mean) -- Q3
 95th percentile: 328 ns (+18% of the mean)
100th percentile: 347 ns (+25% of the mean) -- maximum

Number of outlier (out of 176 ns..363 ns): 0

From the stats, we can get a more exact result about the bytecode's running speed difference., and direct reference is much faster at nano second level.

universe-proton / universe-topology

Direct variable reference as function argument runs faster? #6