Different approach with a few benefits

notEvil commented 10 months ago

Hi,

I've been tinkering with line_profiler and implemented a different approach to line profiling:

https://github.com/notEvil/line_profiler/tree/sub

Essentially, it keeps track of "calls" (PyTrace_CALL and PyTrace_RETURN are used for generators and async as well) and by doing so records "total time": time spent within a function call, generator or async step, excluding time spent in line profiled functions that were called. From there it calculates "cumulative time".

This approach has multiple benefits:

"total time"
times in recursive function calls are attributed correctly (see comparisons below)
times add up (see https://github.com/pyutils/line_profiler#frequently-asked-questions)
calls can be traced
- the prototype creates simple statistics

Potential downside is increased overhead, but in my testing it wasn't much slower.

I've also created a Neovim plugin (not published yet) to see the differences:

```python import time @profile def f(): time.sleep(0.1) (g(1), g(2)) g(3) @profile def g(n): if n == 0: return time.sleep(0.1) g(n - 1) f() ``` Original ![Screenshot from 2023-08-12 13-35-01 crop](https://github.com/pyutils/line_profiler/assets/4567628/d554a546-4d45-4050-9066-113a783be015) New cumulative ![Screenshot from 2023-08-12 13-35-21 crop](https://github.com/pyutils/line_profiler/assets/4567628/ea3c8b60-7283-4b86-9e36-43d22478d1e4) New total ![Screenshot from 2023-08-12 13-35-37 crop](https://github.com/pyutils/line_profiler/assets/4567628/86c6c160-88a5-4e77-aecd-9c6a877bbc3c)

Tested on a real application (40s runtime, no async)

Erotemic commented 10 months ago

It would help to see the python -m line_profiler <script>.lprof results in addition to the neovim (heatmap?) visualization to get a better sense of what is happening and what the difference is.

notEvil commented 10 months ago

There you go

Original ``` Timer unit: 1e-06 s Total time: 0.70161 s File: /home/arappold/git/line_profiler/sub.py Function: f at line 4 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4 @profile 5 def f(): 6 1 100060.9 100060.9 14.3 time.sleep(0.1) 7 1 300288.5 300288.5 42.8 (g(1), g(2)) 8 1 301260.9 301260.9 42.9 g(3) Total time: 0.601441 s File: /home/arappold/git/line_profiler/sub.py Function: g at line 11 Line # Hits Time Per Hit % Time Line Contents ============================================================== 11 @profile 12 def g(n): 13 9 6.2 0.7 0.0 if n == 0: 14 3 0.9 0.3 0.0 return 15 16 6 601357.4 100226.2 100.0 time.sleep(0.1) 17 6 76.5 12.8 0.0 g(n - 1) ``` New cumulative ``` Timer unit: 1e-06 s Total time: 0.701555 s File: /home/arappold/git/line_profiler/sub.py Function: f at line 4 Line # Hits Time Per Hit % Time Line Contents ============================================================== 4 @profile 5 def f(): 6 1 100061.0 100061.0 14.3 time.sleep(0.1) 7 1 300265.7 300265.7 42.8 (g(1), g(2)) 8 1 301228.6 301228.6 42.9 g(3) Total time: 1.00325 s File: /home/arappold/git/line_profiler/sub.py Function: g at line 11 Line # Hits Time Per Hit % Time Line Contents ============================================================== 11 @profile 12 def g(n): 13 1 7.2 7.2 0.0 if n == 0: 14 1 3.6 3.6 0.0 return 15 16 1 601360.9 601360.9 59.9 time.sleep(0.1) 17 1 401881.4 401881.4 40.1 g(n - 1) ``` The prototype doesn't yet count the number of hits (not difficult to add)

Theelx commented 10 months ago

That looks promising, but the tracing seems to have a 4x overhead based on the return function in the base case (assuming it was hit 3 times in your cumulative version). While that's not too significant in this case, it can be very significant in tighter loops, and it would be beneficial to create a tight loop to test this in.

notEvil commented 10 months ago

How do you infer a 4x overhead from that output? I just tested

@profile
def f():
    for _ in range(int(1e7)):
        g()

@profile
def g():
    pass

f()

and it takes 22.37s with v4.0.3 and 18.53s with the new approach. The prototype runs the new approach in addition to the original one, so I temporarily disabled the original part for this crude benchmark.

Theelx commented 10 months ago

Ah, I looked at the return statement in the base case and it looked like it took 1.2us per hit in your version whereas the original took 0.3us per hit. Usually, very simple lines like a plain return statement are virtually instant in Python, and almost all the time taken is in the profiler.

Theelx commented 10 months ago

Here's a benchmark that shows around a 11x slowdown with the current line_profiler vs unprofiled code (total time 500ms unprofiled vs 5.4s profiled):

Time Elapsed tight: 494.96ms
Time Elapsed tight: 5390.15ms
Slowdown from profiling: 4895.19ms, 10.89x

Total time: 5.39021 s
File: tight.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main():
     4         1          3.1      3.1      0.0      start = time.perf_counter()
     5  20000001    2628245.8      0.1     48.8      for x in range(20000000):
     6  20000000    2761886.3      0.1     51.2          y = x
     7         1         31.5     31.5      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1         42.2     42.2      0.0      print(f"Time Elapsed tight: {elapsed_ms}ms")
     9         1          1.3      1.3      0.0      return elapsed_ms

Your code gives these results:

Time Elapsed tight: 498.77ms
Time Elapsed tight: 18367.76ms
Slowdown from profiling: 17868.989999999998ms, 36.83x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s

Total time: 5.91982 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main():
     4         1          1.7      1.7      0.0      start = time.perf_counter()
     5  20000001    2961819.0      0.1     50.0      for x in range(20000000):
     6  20000000    2957938.7      0.1     50.0          y = x
     7         1         18.6     18.6      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1         39.2     39.2      0.0      print(f"Time Elapsed tight: {elapsed_ms}ms")
     9         1          0.6      0.6      0.0      return elapsed_ms

I believe the total time elapsed shown by line_profiler is incorrect, as I definitely waited at least 14 seconds for yours to finish (based off watching the clock). This indicates that your code has a ~37x slowdown, which is ~3x more overhead than the current line_profiler.

Theelx commented 10 months ago

The thing to remember (which I forgot in my first comment here!) is that line_profiler's measurements of time taken in the output aren't actually the overhead, as line_profiler calls hpTimer() to try to measure the overhead, and eliminate it from the output.

notEvil commented 10 months ago

I fail to reproduce your results. Could you try to run them again with the new commit. I changed a few things so you can get a fair comparison with your benchmark.

Theelx commented 10 months ago

This version is significantly better, but still about 1.5x slower:

Time Elapsed tight: 513.24ms
Time Elapsed tight: 9631.74ms
Slowdown from profiling: 9118.5ms, 18.77x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s

Total time: 4.17773 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     3                                           def main():
     4         1          1.7      1.7      0.0      start = time.perf_counter()
     5         1    2005221.9 2005221.9     48.0      for x in range(20000000):
     6         1    2172444.6 2172444.6     52.0          y = x
     7         1         21.0     21.0      0.0      elapsed_ms = round((time.perf_counter()-start)*1000, 2)
     8         1         38.0     38.0      0.0      print(f"Time Elapsed tight: {elapsed_ms}ms")
     9         1          0.7      0.7      0.0      return elapsed_ms

notEvil commented 10 months ago

The weird thing is, my results are the complete opposite. v4.0.3 gives me "Time Elapsed tight: 12030.22ms" while the new commit "Time Elapsed tight: 6169.69ms" (tripple checked). My process

checkout tag/branch
pipenv --rm
pipenv run pip install /path/to/repo (Python 3.11.3)
pipenv run kernprof -l ...

Theelx commented 10 months ago

Ohhhhh, the issue is that I was using my fork from #203 instead of the current one. I can confirm that yours is faster than the version of line_profiler in this repo, but it would be slower if #203 were to be merged.

notEvil commented 10 months ago

Great to hear :) I think most improvements can be carried over. I'll give it a try sometime next week

Theelx commented 10 months ago

See https://github.com/notEvil/line_profiler/pull/1

notEvil commented 10 months ago

Nice, thanks! So if I understand correctly, performance is fine.

Regarding the differences to the current approach, the implementation at the bottom of _line_profiler.pyx is supposed to be readable. Just ignore the parts about call. If its not clear, I can add comments pointing out its details.

notEvil commented 10 months ago

The latest commit now counts the number of line hits and provides the times in a backwards compatible way.

Timer unit: 1e-06 s

Total time: 0.700474 s
File: /home/arappold/git/line_profiler/sub.py
Function: f at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def f():
     6         1     100057.6 100057.6     14.3      time.sleep(0.1)
     7         1     300195.1 300195.1     42.9      (g(1), g(2))
     8         1     300221.3 300221.3     42.9      g(3)

Total time: 1.00071 s
File: /home/arappold/git/line_profiler/sub.py
Function: g at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def g(n):
    13         9          2.4      0.3      0.0      if n == 0:
    14         3          0.5      0.2      0.0          return
    15                                           
    16         6     600365.7 100060.9     60.0      time.sleep(0.1)
    17         6     400342.9  66723.8     40.0      g(n - 1)

"Total time" reported for g is wrong because its summing up cumulative times which don't add up for recursive functions. Total times would, but the default output should remain cumulative times. So maybe remove the "% Time" column if its a recursive function (can be inferred from call stats)?

notEvil commented 10 months ago

An example of what call stats would allow: graph.dot.pdf created with https://github.com/notEvil/line_profiler/blob/sub/graph.py edit: keep in mind that generators create "calls" on every next/yield and therefore directly into sub functions

notEvil commented 10 months ago

I'm happy to announce the next iteration :) Getting useful numbers in case of recursive code was a lot more difficult than anticipated, but here they are nonetheless. The last commit

adds block_times: how much time is spent in each function,
adds call_times: how much time is spent in the function called
and fixes "cumulative time" of lines

It is impossible to derive block_times and call_times from the line timings nor from call stats since in the presence of recursion, cumulative times don't add up.

Now the output for the initial example is

Timer unit: 1e-06 s

Total time: 0.700534 s
File: /home/arappold/git/line_profiler/example1.py
Function: f at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def f():
     6         1     100061.6 100061.6     14.3      time.sleep(0.1)
     7         1     300238.6 300238.6     42.9      (g(1), g(2))
     8         1     300233.5 300233.5     42.9      g(3)

Total time: 0.600456 s
File: /home/arappold/git/line_profiler/example1.py
Function: g at line 11

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    11                                           @profile
    12                                           def g(n):
    13         9          3.6      0.4      0.0      if n == 0:
    14         3          0.8      0.3      0.0          return
    15                                           
    16         6     600380.3 100063.4    100.0      time.sleep(0.1)
    17         6     300267.2  50044.5     50.0      g(n - 1)

Testet with

Code

```python import time def wait(n): time.sleep(n * 0.01) @profile def f(n): f1(n) f2(n) f3(n) f4(n) f5(n) f6(n) @profile def f1(n): if n == 0: return wait(1) f1(n - 1) @profile def f2(n): if n == 0: return wait(1) (f2(n - 1), f2(n // 2)) @profile def f3(n): if n == 0: return wait(1) f3_2(n - 1) @profile def f3_2(n): if n == 0: return wait(1) f3(n - 1) @profile def f4(n): if n == 0: return wait(1) (f4_2(n - 1), f4_2(n // 2)) @profile def f4_2(n): if n == 0: return wait(1) (f4(n - 1), f4(n // 2)) @profile def f5(n): if n == 0: return wait(1) f5_2(n - 1) @profile def f5_2(n): if n == 0: return wait(1) f5_3(n - 1) @profile def f5_3(n): if n == 0: return wait(1) f5(n - 1) @profile def f6(n): if n == 0: return wait(1) (f6_2(n - 1), f6_2(n // 2)) @profile def f6_2(n): if n == 0: return wait(1) (f6_3(n - 1), f6_3(n // 2)) @profile def f6_3(n): if n == 0: return wait(1) (f6(n - 1), f6(n // 2)) f(6) ```

[graph.dot.pdf](https://github.com/pyutils/line_profiler/files/12361167/graph.dot.pdf)

Output

``` Timer unit: 1e-06 s Total time: 0.754708 s File: /home/arappold/git/line_profiler/sub.py Function: f at line 8 Line # Hits Time Per Hit % Time Line Contents ============================================================== 8 @profile 9 def f(n): 10 1 60377.5 60377.5 8.0 f1(n) 11 1 191180.4 191180.4 25.3 f2(n) 12 1 60361.0 60361.0 8.0 f3(n) 13 1 191203.2 191203.2 25.3 f4(n) 14 1 60369.8 60369.8 8.0 f5(n) 15 1 191216.3 191216.3 25.3 f6(n) Total time: 0.0603742 s File: /home/arappold/git/line_profiler/sub.py Function: f1 at line 18 Line # Hits Time Per Hit % Time Line Contents ============================================================== 18 @profile 19 def f1(n): 20 7 1.3 0.2 0.0 if n == 0: 21 1 0.1 0.1 0.0 return 22 23 6 60350.0 10058.3 100.0 wait(1) 24 6 50312.4 8385.4 83.3 f1(n - 1) Total time: 0.191177 s File: /home/arappold/git/line_profiler/sub.py Function: f2 at line 27 Line # Hits Time Per Hit % Time Line Contents ============================================================== 27 @profile 28 def f2(n): 29 39 4.8 0.1 0.0 if n == 0: 30 20 1.1 0.1 0.0 return 31 32 19 191067.2 10056.2 99.9 wait(1) 33 19 181120.1 9532.6 94.7 (f2(n - 1), f2(n // 2)) Total time: 0.0603579 s File: /home/arappold/git/line_profiler/sub.py Function: f3 at line 36 Line # Hits Time Per Hit % Time Line Contents ============================================================== 36 @profile 37 def f3(n): 38 4 0.6 0.1 0.0 if n == 0: 39 1 0.1 0.1 0.0 return 40 41 3 30169.0 10056.3 50.0 wait(1) 42 3 50301.2 16767.1 83.3 f3_2(n - 1) Total time: 0.0502975 s File: /home/arappold/git/line_profiler/sub.py Function: f3_2 at line 45 Line # Hits Time Per Hit % Time Line Contents ============================================================== 45 @profile 46 def f3_2(n): 47 3 0.4 0.1 0.0 if n == 0: 48 return 49 50 3 30168.7 10056.2 60.0 wait(1) 51 3 40241.4 13413.8 80.0 f3(n - 1) Total time: 0.191199 s File: /home/arappold/git/line_profiler/sub.py Function: f4 at line 54 Line # Hits Time Per Hit % Time Line Contents ============================================================== 54 @profile 55 def f4(n): 56 21 2.8 0.1 0.0 if n == 0: 57 12 0.7 0.1 0.0 return 58 59 9 90512.6 10057.0 47.3 wait(1) 60 9 181143.0 20127.0 94.7 (f4_2(n - 1), f4_2(n // 2)) Total time: 0.181137 s File: /home/arappold/git/line_profiler/sub.py Function: f4_2 at line 63 Line # Hits Time Per Hit % Time Line Contents ============================================================== 63 @profile 64 def f4_2(n): 65 18 2.2 0.1 0.0 if n == 0: 66 8 0.4 0.0 0.0 return 67 68 10 100576.7 10057.7 55.5 wait(1) 69 10 161017.3 16101.7 88.9 (f4(n - 1), f4(n // 2)) Total time: 0.060366 s File: /home/arappold/git/line_profiler/sub.py Function: f5 at line 72 Line # Hits Time Per Hit % Time Line Contents ============================================================== 72 @profile 73 def f5(n): 74 3 0.4 0.1 0.0 if n == 0: 75 1 0.1 0.1 0.0 return 76 77 2 20113.6 10056.8 33.3 wait(1) 78 2 50308.6 25154.3 83.3 f5_2(n - 1) Total time: 0.0503043 s File: /home/arappold/git/line_profiler/sub.py Function: f5_2 at line 81 Line # Hits Time Per Hit % Time Line Contents ============================================================== 81 @profile 82 def f5_2(n): 83 2 0.3 0.2 0.0 if n == 0: 84 return 85 86 2 20115.2 10057.6 40.0 wait(1) 87 2 40247.2 20123.6 80.0 f5_3(n - 1) Total time: 0.0402436 s File: /home/arappold/git/line_profiler/sub.py Function: f5_3 at line 90 Line # Hits Time Per Hit % Time Line Contents ============================================================== 90 @profile 91 def f5_3(n): 92 2 0.3 0.1 0.0 if n == 0: 93 return 94 95 2 20113.8 10056.9 50.0 wait(1) 96 2 30186.7 15093.3 75.0 f5(n - 1) Total time: 0.191212 s File: /home/arappold/git/line_profiler/sub.py Function: f6 at line 99 Line # Hits Time Per Hit % Time Line Contents ============================================================== 99 @profile 100 def f6(n): 101 13 1.8 0.1 0.0 if n == 0: 102 6 0.3 0.0 0.0 return 103 104 7 70401.5 10057.4 36.8 wait(1) 105 7 181155.2 25879.3 94.7 (f6_2(n - 1), f6_2(n // 2)) Total time: 0.181149 s File: /home/arappold/git/line_profiler/sub.py Function: f6_2 at line 108 Line # Hits Time Per Hit % Time Line Contents ============================================================== 108 @profile 109 def f6_2(n): 110 14 1.7 0.1 0.0 if n == 0: 111 8 0.5 0.1 0.0 return 112 113 6 60347.1 10057.8 33.3 wait(1) 114 6 161035.2 26839.2 88.9 (f6_3(n - 1), f6_3(n // 2)) Total time: 0.161023 s File: /home/arappold/git/line_profiler/sub.py Function: f6_3 at line 117 Line # Hits Time Per Hit % Time Line Contents ============================================================== 117 @profile 118 def f6_3(n): 119 12 1.6 0.1 0.0 if n == 0: 120 6 0.3 0.1 0.0 return 121 122 6 60343.6 10057.3 37.5 wait(1) 123 6 120793.6 20132.3 75.0 (f6(n - 1), f6(n // 2)) ```

Theelx commented 10 months ago

Thanks for this great work! I'm going to rebase my PR to your fork on this latest work, could you take a look at it once I do that? After this work of yours is merged, I can start working on adding CPython 3.12 support to line_profiler.

notEvil commented 10 months ago

Thanks :) I would still consider it work-in-progress. The call stats feature is somewhat out of scope of line profiling, naming of stuff breaks with former conventions (which is different from the more common convention of "total" vs "cumulative"), some parts are pretty ugly, the default output from CLI could make better use of the available information, ... And there is performance, which has to be worse than before (when both are equally optimized). It would be great if someone more experienced with the existing code base could review the fork and propose changes.

Theelx commented 10 months ago

I just tested the performance, and it is indeed significantly worse than before. Previously, https://github.com/notEvil/line_profiler/pull/1 was able to pull it down to 13x overhead over no profiling, in the tight loop case from above, whereas now with the same optimizations applied, it'd 16x overhead over no profiling. That being said, I believe I can optimize it further, as some of your data structure choices may be suboptimal performance-wise.

tmm1 commented 10 months ago

@notEvil thanks for sharing this! many years ago i wrote a line profiler for another dynamic language runtime and I think I used a similar call/return stacking approach.

if you get a chance, would you mind taking a look at the merge conflicts pulling in the latest main after the recent autoprofiler changes. would love to try the two together.

notEvil commented 8 months ago

@tmm1 sry for not responding! I will check the conflicts soon, I promise ^^

I've pushed a new commit* which should improve confidence in the solution. The previous version was manually tested on a few examples with manual inspection of the output. graph.py was very helpful, but obviously we'd want something automated with a lot more variation. So I started the journey of implementing strategies for generating code with Hypothesis**, implementing a reference profiler and using it to validate the numbers. And it took a while to work as I wanted it to (statements spanning multiple lines, recursions, generators, async, ...), to look reasonably good and eventually for the tests to pass. So my apologies for keeping this private and not providing any updates.

The testing is not perfect:

It might generate lots of useless examples
It uses busy loops to simulate work
- context switches introduce inaccuracies
- busy loops are more accurate than time.sleep
- inaccurate timings can and will eventually produce false positives
It doesn't validate call profiles (hit counts and cumulative time) and call stats
- a "call" (CALL RETURN pair) might be a function call or a generator/coroutine continuation
- it is not easy to add generator/coroutine "call"s to the reference profiler
- I don't think it makes sense to further complicate the Cython code to distinguish them

What I like:

The generated examples look very diverse and complex
The reference profiler is very simple

So now that it "works" and won't get any more complex, I'd say its a good time to review and benchmark.

https://github.com/notEvil/line_profiler/tree/sub

* changes

added counting of hits for blocks and calls
added counting of primitive hits for lines, blocks and calls
added randomized tests
improved graph.py to show the additional information
renamed structs LineTime and Timing to CLineProfile and LineProfile respectively
added structs CBlockProfile and CCallProfile, and classes BlockProfile and CallProfile
fixed minor regressions

** If you don't know Hypothesis yet, I'd definitely recommend checking it out!

notEvil commented 8 months ago

@tmm1 just merged main and except for trivial conflicts it went smooth. The randomized tests still pass :) ~~Beware though that line_profiler.py isn't adjusted yet and might show unexpected things, especially aggregates.~~

@Theelx you created a PR back then for better performance and comparisons. I didn't merge because I didn't want to mix the two proposals and it looked like it was based off of https://github.com/pyutils/line_profiler/pull/203 but without git reference (merge). Would you consider the PR mature enough to merge with sub on a test branch?

notEvil commented 8 months ago

Just a quick update: I tried to measure and improve performance and got some interesting results. The benchmark is https://gitlab.com/notEvil/code_collapse applied to itself which spends most of the time in Black and the remaining time in nested generators (tree traversal) and LibCST.

name	Python 3.11	vs baseline	Python 3.12	vs baseline
baseline	16.1		34.3
v4.1.2	27.2	168.9%	178.6	520.7%
sub	28.2	175.2%	167.1	487.2%
key opt	28	173.9%	161.6	471.1%
tss opt*	27.6	171.4%
preshed opt*	27.2	168.9%
map opt*	26.3	163.4%
time opt*	25.8	160.3%
all opt	25.2	156.5%	150.9	439.9%

key opt: use address of bytecode << 24 + line number as key (with lower bits of address as offset for some randomness)
- not worth it imo, but the first one I tried
tss opt: store thread id in thread local storage to save some threading.get_ident()
preshed opt: use Preshed for entry check
map opt: use parallel_flat_hash_map and flat_hash_map from parallel-hashmap instead of unordered_map
time opt: check time just after entry check instead of asap
all opt: all combined, see https://github.com/notEvil/line_profiler/tree/subopt

As you can see, I wasn't able to significantly improve performance of sub. Maybe time opt is worth the tradeoff (creates some bias), what do you think?

I will try to figure out why Python 3.12 is performing so bad compared to 3.11 on my project. Regarding the bad performance when line profiling on 3.12, can you reproduce this? e.g. jsonpickle

Update 2023-11-04

Python 3.12 is a bit slower than 3.11 in general, but the large difference is due to Black not compiling for 3.12 (due to a bug in Mypyc). So the benchmark is flawed in so far as it presents a measure of slowdown that is not representative (most time is spent in compiled code which doesn't add any overhead). Maybe we could run pyperformance somehow.

Also regarding key opt, turns out it is not uncommon for short functions to have the exact same bytecode (get_metadata in pyperf and at least two in re) and therefore the same hash. So sub is currently broken (https://github.com/notEvil/line_profiler/blob/f0c17516e1012dd99e9e161a6a5057a97b60a02b/line_profiler/_line_profiler.pyx#L593) and current line_profiler is affected if the line ranges of such functions intersect. key opt would solve it.

* either directly from https://github.com/pyutils/line_profiler/pull/203 or inspired by

notEvil commented 8 months ago

Good news I guess:

| name | baseline | v4.1.2 | sub | preshed opt | key opt | tss opt | time opt | map opt | subopt | |:---------------------------|-----------:|---------:|:---------|--------------:|----------:|:----------|:-----------|:----------|:----------| | generators | 57.18 | 52.70 | 58.95 | 56.27 | 56.43 | 53.36 | 53.65 | 52.26 | **48.58** | | deltablue | 4.46 | 36.44 | 32.43 | 30.90 | 30.04 | 28.51 | 28.67 | 28.87 | **25.98** | | richards | 54.57 | 34.22 | 28.39 | 26.44 | 25.54 | **23.47** | 24.36 | 23.68 | 23.72 | | richards_super | 66.02 | 31.38 | 26.71 | 24.68 | 23.29 | 21.88 | 22.82 | 22.03 | **21.61** | | raytrace | 331.44 | 26.21 | 25.36 | 24.02 | 22.78 | 21.72 | 21.95 | 21.25 | **20.72** | | scimark_sor | 135.53 | 26.55 | 24.00 | 23.27 | 22.11 | 20.61 | 21.08 | 20.71 | **20.41** | | scimark_lu | 128.57 | 24.43 | 21.93 | 21.37 | 20.67 | 19.62 | 19.61 | **19.04** | 19.37 | | hexiom | 7.08 | 24.97 | 22.51 | 21.27 | 20.02 | 18.92 | 19.62 | 18.90 | **18.42** | | go | 154.55 | 31.02 | 23.95 | 21.68 | 21.09 | 19.64 | 20.74 | 19.70 | **18.35** | | spectral_norm | 116.12 | 19.70 | 17.70 | 17.10 | 16.44 | 15.51 | 15.74 | 15.54 | **15.18** | | logging_silent | 0.00 | 24.76 | 19.56 | 16.20 | 16.40 | 14.26 | 16.67 | 14.63 | **13.42** | | scimark_monte_carlo | 75.35 | 23.64 | 16.81 | 15.02 | 14.47 | 13.15 | 14.32 | 13.40 | **12.67** | | pyflate | 484.89 | 18.75 | 14.83 | 13.62 | 13.20 | **11.84** | 12.91 | 12.24 | 11.91 | | float | 90.14 | 16.91 | 13.15 | 12.60 | 12.27 | 11.48 | 11.83 | 11.47 | **11.33** | | deepcopy_memo | 0.04 | 25.66 | 17.43 | 13.73 | 14.40 | 12.50 | 14.69 | 13.26 | **11.31** | | chaos | 75.99 | 17.22 | 13.79 | 12.68 | 12.63 | 11.41 | 12.09 | 11.56 | **11.24** | | unpack_sequence | 0.00 | 19.93 | 12.92 | 11.42 | 11.29 | 10.12 | 11.33 | 10.69 | **10.10** | | deepcopy_reduce | 0.00 | 18.36 | 13.14 | 10.96 | 11.23 | 9.99 | 11.39 | 10.40 | **9.33** | | deepcopy | 0.40 | 18.00 | 12.54 | 10.02 | 10.59 | 9.40 | 11.00 | 9.37 | **8.31** | | pprint_pformat | 1645.84 | 15.44 | 10.61 | 9.34 | 9.27 | 8.11 | 9.24 | 8.54 | **7.90** | | logging_simple | 0.01 | 13.87 | 10.63 | 8.89 | 8.88 | 7.92 | 8.84 | 8.01 | **7.59** | | pprint_safe_repr | 804.19 | 15.28 | 10.22 | 8.56 | 8.73 | 7.59 | 8.81 | 8.08 | **7.54** | | logging_format | 0.01 | 12.97 | 10.10 | 8.41 | 8.42 | 7.58 | 8.42 | 7.61 | **7.25** | | nbody | 112.98 | 14.13 | 8.85 | 7.67 | 7.67 | 6.83 | 7.76 | 7.32 | **6.74** | | comprehensions | 0.03 | 7.25 | 7.35 | 7.12 | 7.03 | 6.73 | 6.31 | 6.52 | **6.30** | | fannkuch | 426.18 | 14.32 | 8.48 | 7.33 | 7.16 | **6.16** | 7.23 | 7.00 | 6.18 | | tomli_loads | 2337.04 | 6.11 | 6.12 | 6.78 | 6.14 | 6.03 | **4.79** | 5.57 | 5.60 | | coroutines | 28.96 | 5.29 | 5.58 | 5.71 | 5.71 | 5.61 | **4.92** | 5.37 | 5.05 | | genshi_text | 25.17 | 5.47 | 6.08 | 5.88 | 6.00 | 5.99 | **4.88** | 5.44 | 4.99 | | scimark_sparse_mat_mult | 4.77 | 11.09 | 6.65 | 5.73 | 5.69 | 5.06 | 5.73 | 5.26 | **4.95** | | scimark_fft | 358.69 | 10.09 | 6.39 | 5.61 | 5.46 | 4.83 | 5.54 | 5.17 | **4.79** | | genshi_xml | 57.81 | 5.76 | 5.91 | 5.51 | 5.64 | 5.42 | 4.77 | 5.12 | **4.72** | | chameleon | 7.54 | 5.41 | 5.66 | 5.63 | 5.57 | 5.37 | **4.40** | 5.11 | 4.67 | | nqueens | 93.59 | 7.13 | 6.04 | 5.60 | 5.55 | 5.28 | 5.12 | 5.22 | **4.67** | | meteor_contest | 119.01 | 9.93 | 5.96 | 5.21 | 5.19 | 4.56 | 5.23 | 4.88 | **4.39** | | async_generators | 409.54 | 5.05 | 4.87 | 4.81 | 4.77 | 4.66 | 4.34 | 4.39 | **4.29** | | crypto_pyaes | 82.58 | 8.73 | 5.38 | 4.89 | 4.69 | 4.27 | 4.68 | 4.50 | **4.21** | | sqlglot_parse | 1.58 | 4.52 | 4.67 | 4.66 | 4.66 | 4.68 | **3.99** | 4.30 | 4.19 | | coverage | 82.92 | 4.06 | 4.06 | 4.08 | 4.08 | 4.05 | 4.10 | **4.03** | 4.05 | | async_tree_none_tg | 536.83 | 7.00 | 5.52 | 4.77 | 4.86 | 4.14 | 4.53 | 4.18 | **3.96** | | sqlglot_transpile | 1.92 | 4.31 | 4.43 | 4.43 | 4.41 | 4.44 | **3.77** | 4.05 | 3.94 | | mako | 11.94 | 4.36 | 4.44 | 4.38 | 4.45 | 4.41 | **3.82** | 4.11 | 3.91 | | xml_etree_process | 62.94 | 4.72 | 4.57 | 4.76 | 4.64 | 4.55 | **3.70** | 4.29 | 3.88 | | async_tree_memoization_tg | 718.20 | 6.29 | 4.99 | 4.41 | 4.43 | 3.98 | 4.48 | 3.99 | **3.75** | | sqlglot_normalize | 123.56 | 3.98 | 4.17 | 4.31 | 4.11 | 4.08 | **3.55** | 3.77 | 3.69 | | async_tree_none | 603.23 | 6.02 | 4.83 | 4.20 | 4.20 | 3.84 | 4.17 | 3.85 | **3.69** | | async_tree_memoization | 765.38 | 5.85 | 4.93 | 4.11 | 4.16 | 3.80 | 4.27 | 3.79 | **3.64** | | async_tree_io_tg | 1354.10 | 5.92 | 4.79 | 4.20 | 4.24 | 3.82 | 4.18 | 3.85 | **3.57** | | async_tree_io | 1407.59 | 5.66 | 4.64 | 4.01 | 4.07 | 3.70 | 4.02 | 3.70 | **3.51** | | sqlglot_optimize | 60.38 | 3.86 | 3.97 | 4.00 | 3.99 | 3.94 | **3.43** | 3.55 | 3.48 | | html5lib | 67.36 | 4.46 | 4.00 | 3.85 | 3.86 | 3.77 | 3.45 | 3.66 | **3.38** | | async_tree_cpu_io_mixed_tg | 823.45 | 5.33 | 4.33 | 3.72 | 3.75 | 3.41 | 3.71 | 3.40 | **3.24** | | async_tree_cpu_io_mixed | 850.96 | 5.24 | 4.14 | 3.59 | 3.62 | 3.33 | 3.60 | 3.32 | **3.17** | | xml_etree_generate | 91.35 | 3.78 | 3.61 | 3.89 | 3.65 | 3.63 | **3.00** | 3.41 | 3.12 | | mdp | 3100.88 | 4.11 | 3.64 | 3.41 | 3.39 | 3.24 | 3.14 | 3.15 | **3.01** | | json_dumps | 13.07 | 5.30 | 3.89 | 3.24 | 3.36 | 2.99 | 3.38 | 3.10 | **2.84** | | sqlite_synth | 0.00 | 3.98 | 3.18 | 3.01 | 3.01 | 2.81 | 2.95 | 2.88 | **2.78** | | telco | 7.70 | 5.50 | 3.72 | 3.07 | 3.03 | **2.72** | 3.08 | 2.95 | 2.77 | | sympy_expand | 541.44 | 2.84 | 2.96 | 3.05 | 2.96 | 2.92 | **2.51** | 2.76 | 2.76 | | sympy_str | 334.32 | 2.75 | 2.88 | 3.01 | 2.89 | 2.86 | **2.47** | 2.68 | 2.71 | | sympy_integrate | 23.89 | 2.81 | 2.95 | 3.10 | 2.98 | 2.95 | **2.44** | 2.74 | 2.70 | | typing_runtime_protocols | 0.54 | 2.83 | 2.82 | 2.94 | 2.76 | 2.76 | **2.51** | 2.66 | 2.63 | | pathlib | 20.81 | 3.95 | 3.37 | 2.94 | 3.03 | 2.74 | 2.99 | 2.79 | **2.62** | | dulwich_log | 82.35 | 3.18 | 2.96 | 2.87 | 2.87 | 2.76 | 2.64 | 2.63 | **2.59** | | sqlalchemy_imperative | 22.06 | 2.80 | 2.75 | 2.67 | 2.70 | 2.70 | **2.45** | 2.49 | 2.45 | | sympy_sum | 196.00 | 2.39 | 2.48 | 2.54 | 2.47 | 2.45 | **2.17** | 2.31 | 2.30 | | create_gc_cycles | 1.29 | 2.66 | 1.95 | 2.87 | 2.48 | 2.33 | **1.91** | 2.38 | 2.27 | | tornado_http | 146.44 | 2.48 | 2.43 | 2.60 | 2.36 | 2.33 | **2.18** | 2.25 | 2.26 | | sqlalchemy_declarative | 154.12 | 2.38 | 2.39 | 2.47 | 2.37 | 2.35 | **2.12** | 2.22 | 2.24 | | docutils | 3014.27 | 2.41 | 2.45 | 2.46 | 2.43 | 2.41 | **2.14** | 2.32 | 2.20 | | xml_etree_iterparse | 126.73 | 2.22 | 2.29 | 2.39 | 2.32 | 2.25 | **1.94** | 2.18 | 2.05 | | regex_v8 | 23.23 | 2.22 | 1.79 | 1.70 | 1.72 | 1.62 | 1.70 | 1.63 | **1.58** | | json_loads | 0.03 | 1.88 | 1.65 | 1.52 | 1.53 | 1.47 | 1.55 | 1.46 | **1.43** | | pidigits | 204.68 | 1.55 | 1.46 | 1.43 | 1.41 | **1.38** | 1.39 | 1.39 | 1.38 | | gc_traversal | 4.39 | 1.52 | 1.35 | 1.45 | 1.50 | 1.31 | **1.28** | 1.38 | 1.35 | | asyncio_tcp | 969.36 | nan | nan | nan | nan | 1.40 | 1.36 | 1.33 | **1.30** | | xml_etree_parse | 182.39 | 1.24 | 1.25 | 1.31 | 1.25 | 1.26 | **1.17** | 1.22 | 1.21 | | regex_effbot | 3.18 | 1.26 | 1.18 | 1.14 | 1.14 | 1.13 | 1.15 | 1.13 | **1.12** | | asyncio_tcp_ssl | 2924.68 | nan | nan | nan | nan | 1.12 | 1.12 | 1.11 | **1.10** | | 2to3 | 318.27 | 0.97 | 0.95 | 0.98 | 0.98 | 0.98 | 0.96 | **0.93** | 1.06 | | pickle | 0.01 | 1.19 | 1.11 | 1.09 | 1.08 | **1.06** | 1.09 | 1.08 | 1.06 | | unpickle | 0.01 | 1.13 | 1.07 | 1.05 | 1.05 | **1.04** | 1.05 | 1.06 | 1.05 | | pickle_list | 0.00 | 1.12 | 1.07 | 1.05 | 1.05 | 1.04 | 1.05 | 1.05 | **1.04** | | unpickle_list | 0.01 | 1.09 | 1.05 | 1.04 | 1.04 | 1.05 | 1.04 | **1.02** | 1.03 | | asyncio_websockets | 643.37 | 1.06 | 1.05 | 1.05 | nan | **1.00** | 1.01 | 1.01 | 1.01 | | regex_dna | 187.26 | 0.99 | 1.00 | 0.99 | 0.99 | 0.99 | 0.99 | **0.99** | 1.00 | | pickle_dict | 0.03 | 1.02 | 1.01 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | **1.00** | | python_startup | 11.26 | 0.98 | **0.95** | 1.00 | 0.97 | 0.96 | 0.95 | 0.95 | 0.98 | | python_startup_no_site | 8.62 | 0.98 | **0.95** | 0.95 | 0.99 | 0.95 | 0.95 | 0.96 | 0.97 | | bench_mp_pool | 14.86 | nan | nan | nan | nan | nan | nan | nan | nan | | bench_thread_pool | 1.22 | nan | nan | nan | nan | nan | nan | nan | nan | | dask | 500.97 | nan | nan | nan | nan | nan | nan | nan | nan | | django_template | 37.73 | nan | nan | nan | nan | nan | nan | nan | nan | | pickle_pure_python | 0.34 | nan | nan | nan | nan | nan | nan | nan | nan | | regex_compile | 150.42 | nan | nan | nan | nan | nan | nan | nan | nan | | unpickle_pure_python | 0.26 | nan | nan | nan | nan | nan | nan | nan | nan | Average factor (geometric mean, complete case only) ``` subopt 4.251947 map opt 4.499286 time opt 4.535526 tss opt 4.547399 key opt 4.908084 preshed opt 4.987763 sub 5.329236 v4.1.2 6.309396 dtype: float64 ```

I managed to adjust pyperformance so it uses line_profiler (with autoprofile) and ran it*. See https://github.com/notEvil/line_profiler/tree/sub/pyperformance and https://github.com/notEvil/pyperformance/tree/line_profiler if you want to do the same. Set up the Pipenv environment and run prepare.py to prepare an environment. Then run pyperformance in that environment.

The column baseline is the average time in ms, the others are scaling factors (e.g. 1.5 means 1.5 * baseline). Some benchmarks like 2to3 spawn subprocesses which explains some ~1 factors. ~~Didn't investigate the < 1 yet.~~ nevermind, should've used pyperf to load the json from the start.

Update 2023-11-06

Added the opt variants and sorted by average factor. Interestingly, the success of time opt indicates benchmarks where the entry check fails a lot. However, map opt is on par and doesn't add bias. So overall my take is: preshed is not worth it**, time opt might be and map opt, tss opt and key opt are. ~~What is still strange is that sub and key opt are so different although they use a very similar approach.~~ (sub used id() while key opt a type cast. With type cast, sub is a lot closer to key opt now).

* on Intel 4570S with 3 of them in parallel (1 idle core). Most nan are due to conflicts (e.g. bind to port). ** not better than key opt

pyutils / line_profiler

Different approach with a few benefits #230