wildly incorrect relative performance of generator expression vs map, off by 2 orders of magnitude

Wall time testing gives relative time spent in generator expression, list comprehension, map, and loop doing the same job to be roughly 1.3 : 1.6 : 1 : 1.8, while pprofile gives 102 (!) : 32 : 1 : 60:

Command line: genexpr-vs-map
Total duration: 1098.69s
File: genexpr-vs-map
File duration: 1098.68s (100.00%)
Line #|      Hits|         Time| Time per hit|      %|Source code
------+----------+-------------+-------------+-------+-----------
     1|         0|            0|            0|  0.00%|#!/usr/bin/python3
     2|         0|            0|            0|  0.00%|
     3|         2|  0.000146389|  7.31945e-05|  0.00%|import sys
     4|         0|            0|            0|  0.00%|
     5|         2|  4.60148e-05|  2.30074e-05|  0.00%|def profile_generator_expression():
     6|  20000003|       571.97|  2.85985e-05| 52.06%|    return tuple(str(n) for n in r)
(call)|  10000001|      293.337|  2.93337e-05| 26.70%|# genexpr-vs-map:6 <genexpr>
     7|         0|            0|            0|  0.00%|
     8|         2|  6.38962e-05|  3.19481e-05|  0.00%|def profile_list_comprehension():
     9|  10000003|      181.039|  1.81039e-05| 16.48%|    return tuple([str(n) for n in r])
(call)|         1|      180.307|      180.307| 16.41%|# genexpr-vs-map:9 <listcomp>
    10|         0|            0|            0|  0.00%|
    11|         2|  6.77109e-05|  3.38554e-05|  0.00%|def profile_map():
    12|         1|      5.55134|      5.55134|  0.51%|    return tuple(map(str, r))
    13|         0|            0|            0|  0.00%|
    14|         2|  6.55651e-05|  3.27826e-05|  0.00%|def profile_loop():
    15|         1|  2.71797e-05|  2.71797e-05|  0.00%|    result = []
    16|  10000001|      159.363|  1.59363e-05| 14.50%|    for n in r:
    17|  10000000|      177.635|  1.77635e-05| 16.17%|        result.append(str(n))
    18|         1|     0.415125|     0.415125|  0.04%|    return tuple(result)
    19|         0|            0|            0|  0.00%|
    20|         1|  3.19481e-05|  3.19481e-05|  0.00%|r = range(10 ** 7)
    21|         0|            0|            0|  0.00%|
    22|         1|  2.64645e-05|  2.64645e-05|  0.00%|profile = (
    23|         1|  2.74181e-05|  2.74181e-05|  0.00%|    profile_generator_expression,
    24|         1|  2.64645e-05|  2.64645e-05|  0.00%|    profile_list_comprehension,
    25|         1|   2.6226e-05|   2.6226e-05|  0.00%|    profile_map,
    26|         1|  2.64645e-05|  2.64645e-05|  0.00%|    profile_loop,
    27|         0|            0|            0|  0.00%|)
    28|         0|            0|            0|  0.00%|
    29|         1|  2.95639e-05|  2.95639e-05|  0.00%|if len(sys.argv) == 1:
    30|         1|  2.67029e-05|  2.67029e-05|  0.00%|    arg = '0123'
    31|         0|            0|            0|  0.00%|else:
    32|         0|            0|            0|  0.00%|    arg, = sys.argv[1:]
    33|         0|            0|            0|  0.00%|
    34|         5|   0.00044775|    8.955e-05|  0.00%|for ch in arg:
    35|         4|      2.70997|     0.677493|  0.25%|    profile[int(ch)]()
(call)|         1|       571.97|       571.97| 52.06%|# genexpr-vs-map:5 profile_generator_expression
(call)|         1|       181.04|       181.04| 16.48%|# genexpr-vs-map:8 profile_list_comprehension
(call)|         1|      5.55138|      5.55138|  0.51%|# genexpr-vs-map:11 profile_map
(call)|         1|      337.413|      337.413| 30.71%|# genexpr-vs-map:14 profile_loop

Tested on Debian with python3-pprofile 2.0.5-1 and python3.9 3.9.2-1.

pprofile works by registering a callback to the python interpreter, so its overhead on such tight loops will depend heavily on how many times the callback is triggered. map will not trigger this callback as it is not implemented in python, so it will have the lowest profiling overhead. Visibly the iterator somehow triggers the callback 3 times per iteration (one call/return event pair and one line execution event) while the list comprehension triggers it once per iteration, and the loop triggers it twice per iteration.

I chose this design as I prioritize same-codebase portability between interpreters as more valuable than an absolute lower profiling overhead. Which of course breaks down catastrophically on micro-benchmarks.

Auto-calibration of the profiling overhead is a tempting solution on the surface. It comes with its own downsides as it introduces some amount of noise in the results: maybe calibration will happen while the CPU is still on a lower frequency causing an overestimate if profiled code runs while the CPU is on a higher frequency, or vice-versa. As a consequence, the results could randomly give nonsensical values, like negative execution time, if the conditions are just wrong enough. Whether implementing such auto-calibration is worth the effort is unclear to me.

To me, a better alternative is to also try the statistical profiling mode. If both results diverge, then it means either profile result is hitting some edge-case (in this case, an extreme profiling overhead in deterministic profiling mode).

With statistical profiling at period .01 s, map is again not visible, but other results are much more plausible. Also, lines 11, 15, 21 (but not 7) are always hit exactly once per iteration of the main loop, even if the profiling period is decreased. Similarly, line 39 is always hit exactly 6 times per main loop iteration.

Do you have any advice for profiling real world code? Should I be aware of other language constructs invisible to the profiler like map, or of other pitfalls?

Command line: genexpr-vs-map
Total duration: 341.422s
File: genexpr-vs-map
File duration: 0s (0.00%)
Line #|      Hits|         Time| Time per hit|      %|Source code
------+----------+-------------+-------------+-------+-----------
     1|         0|            0|            0|  0.00%|#!/usr/bin/python3
     2|         0|            0|            0|  0.00%|
     3|         0|            0|            0|  0.00%|import sys
     4|         0|            0|            0|  0.00%|
     5|         0|            0|            0|  0.00%|def profile_generator_expression():
     6|      4803|            0|            0|  0.00%|    resultgen = (str(n) for n in r)
     7|         0|            0|            0|  0.00%|    return tuple(resultgen)
(call)|      4803|            0|            0|  0.00%|# gvmm:6 <genexpr>
     8|         0|            0|            0|  0.00%|
     9|         0|            0|            0|  0.00%|def profile_list_comprehension():
    10|      5230|            0|            0|  0.00%|    result = [str(n) for n in r]
(call)|      5230|            0|            0|  0.00%|# gvmm:10 <listcomp>
    11|        10|            0|            0|  0.00%|    return tuple(result)
    12|         0|            0|            0|  0.00%|
    13|         0|            0|            0|  0.00%|def profile_map():
    14|         0|            0|            0|  0.00%|    resultgen = map(str, r)
    15|        10|            0|            0|  0.00%|    return tuple(resultgen)
    16|         0|            0|            0|  0.00%|
    17|         0|            0|            0|  0.00%|def profile_loop():
    18|         0|            0|            0|  0.00%|    result = []
    19|       296|            0|            0|  0.00%|    for n in r:
    20|      5581|            0|            0|  0.00%|        result.append(str(n))
    21|        10|            0|            0|  0.00%|    return tuple(result)
    22|         0|            0|            0|  0.00%|
    23|         0|            0|            0|  0.00%|r = range(10 ** 7)
    24|         0|            0|            0|  0.00%|
    25|         0|            0|            0|  0.00%|profile = (
    26|         0|            0|            0|  0.00%|    profile_generator_expression,
    27|         0|            0|            0|  0.00%|    profile_list_comprehension,
    28|         0|            0|            0|  0.00%|    profile_map,
    29|         0|            0|            0|  0.00%|    profile_loop,
    30|         0|            0|            0|  0.00%|)
    31|         0|            0|            0|  0.00%|
    32|         0|            0|            0|  0.00%|if len(sys.argv) == 1:
    33|         0|            0|            0|  0.00%|    arg = '0123'
    34|         0|            0|            0|  0.00%|else:
    35|         0|            0|            0|  0.00%|    arg, = sys.argv[1:]
    36|         0|            0|            0|  0.00%|
    37|         0|            0|            0|  0.00%|for _ in range(10):
    38|         0|            0|            0|  0.00%|    for ch in arg:
    39|        60|            0|            0|  0.00%|        profile[int(ch)]()
(call)|      4803|            0|            0|  0.00%|# gvmm:5 profile_generator_expression
(call)|      5240|            0|            0|  0.00%|# gvmm:9 profile_list_comprehension
(call)|        10|            0|            0|  0.00%|# gvmm:13 profile_map
(call)|      5887|            0|            0|  0.00%|# gvmm:17 profile_loop

map being invisible in statistical mode is likely because it does not release the GIL: statistical mode spawns a python thread with basically a loop, a sleep and a call-stack sampling call. So it needs python code to be allowed to run, which means the GIL must be released. The intent of this mode is to be used on long-enough sampling sessions that the thread will run enough times, and while it will not trigger with perfect regularity it should average out and collect the hot-spots anyway. I would consider your code to be running for long enough, and it is annoying that map is able to fly under the radar.

There are certainly more gotchas which I've gotten used to, so I am certainly not able to give you a clean exhaustive list.

For real-world code profiling, my first recommendation would be to use cachegrind output format, and browsing it with kcachegrind (or its sibling, qcachegrind, depending on your OS). Annotation output is fine for bug reports and small codebases, but quite impractical to browse.

My typically approach is to identify hot-spots using statistical profiling (as it has a low overhead), then write code to exercise those hot-spots outside of the whole program in order to get something running for less than a minute, and then get more details using deterministic profiling. Depending on how well I know the code, call counts can be a more useful indication than execution duration (ex: if my exercising code processes 1k items and I see some expensive calls being done N*1k times, they are where I would start looking into).

Another consideration is what the code lifecycle looks like. For example, in a long-running process with occasional periods of interest for profiling, you will want to have some way to trigger the profiling from outside the process. How to do this depends a lot on the implementation details of the process: maybe a sighandler and writing the profiling result to some pre-determined file, or some more elaborate IPC depending on what's available.

Thank you for your help, I will try to make a good use of what is available.

vpelletier / pprofile

wildly incorrect relative performance of generator expression vs map, off by 2 orders of magnitude #40