Open notEvil opened 10 months ago
It would help to see the python -m line_profiler <script>.lprof
results in addition to the neovim (heatmap?) visualization to get a better sense of what is happening and what the difference is.
There you go
That looks promising, but the tracing seems to have a 4x overhead based on the return function in the base case (assuming it was hit 3 times in your cumulative version). While that's not too significant in this case, it can be very significant in tighter loops, and it would be beneficial to create a tight loop to test this in.
How do you infer a 4x overhead from that output? I just tested
@profile
def f():
for _ in range(int(1e7)):
g()
@profile
def g():
pass
f()
and it takes 22.37s with v4.0.3 and 18.53s with the new approach. The prototype runs the new approach in addition to the original one, so I temporarily disabled the original part for this crude benchmark.
Ah, I looked at the return statement in the base case and it looked like it took 1.2us per hit in your version whereas the original took 0.3us per hit. Usually, very simple lines like a plain return statement are virtually instant in Python, and almost all the time taken is in the profiler.
Here's a benchmark that shows around a 11x slowdown with the current line_profiler vs unprofiled code (total time 500ms unprofiled vs 5.4s profiled):
Time Elapsed tight: 494.96ms
Time Elapsed tight: 5390.15ms
Slowdown from profiling: 4895.19ms, 10.89x
Total time: 5.39021 s
File: tight.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main():
4 1 3.1 3.1 0.0 start = time.perf_counter()
5 20000001 2628245.8 0.1 48.8 for x in range(20000000):
6 20000000 2761886.3 0.1 51.2 y = x
7 1 31.5 31.5 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 42.2 42.2 0.0 print(f"Time Elapsed tight: {elapsed_ms}ms")
9 1 1.3 1.3 0.0 return elapsed_ms
Your code gives these results:
Time Elapsed tight: 498.77ms
Time Elapsed tight: 18367.76ms
Slowdown from profiling: 17868.989999999998ms, 36.83x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s
Total time: 5.91982 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main():
4 1 1.7 1.7 0.0 start = time.perf_counter()
5 20000001 2961819.0 0.1 50.0 for x in range(20000000):
6 20000000 2957938.7 0.1 50.0 y = x
7 1 18.6 18.6 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 39.2 39.2 0.0 print(f"Time Elapsed tight: {elapsed_ms}ms")
9 1 0.6 0.6 0.0 return elapsed_ms
I believe the total time elapsed shown by line_profiler is incorrect, as I definitely waited at least 14 seconds for yours to finish (based off watching the clock). This indicates that your code has a ~37x slowdown, which is ~3x more overhead than the current line_profiler.
The thing to remember (which I forgot in my first comment here!) is that line_profiler's measurements of time taken in the output aren't actually the overhead, as line_profiler calls hpTimer() to try to measure the overhead, and eliminate it from the output.
I fail to reproduce your results. Could you try to run them again with the new commit. I changed a few things so you can get a fair comparison with your benchmark.
This version is significantly better, but still about 1.5x slower:
Time Elapsed tight: 513.24ms
Time Elapsed tight: 9631.74ms
Slowdown from profiling: 9118.5ms, 18.77x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s
Total time: 4.17773 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main():
4 1 1.7 1.7 0.0 start = time.perf_counter()
5 1 2005221.9 2005221.9 48.0 for x in range(20000000):
6 1 2172444.6 2172444.6 52.0 y = x
7 1 21.0 21.0 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 38.0 38.0 0.0 print(f"Time Elapsed tight: {elapsed_ms}ms")
9 1 0.7 0.7 0.0 return elapsed_ms
The weird thing is, my results are the complete opposite. v4.0.3 gives me "Time Elapsed tight: 12030.22ms" while the new commit "Time Elapsed tight: 6169.69ms" (tripple checked). My process
pipenv --rm
pipenv run pip install /path/to/repo
(Python 3.11.3)pipenv run kernprof -l ...
Ohhhhh, the issue is that I was using my fork from #203 instead of the current one. I can confirm that yours is faster than the version of line_profiler in this repo, but it would be slower if #203 were to be merged.
Great to hear :) I think most improvements can be carried over. I'll give it a try sometime next week
Nice, thanks! So if I understand correctly, performance is fine.
Regarding the differences to the current approach, the implementation at the bottom of _line_profiler.pyx
is supposed to be readable. Just ignore the parts about call
. If its not clear, I can add comments pointing out its details.
The latest commit now counts the number of line hits and provides the times in a backwards compatible way.
Timer unit: 1e-06 s
Total time: 0.700474 s
File: /home/arappold/git/line_profiler/sub.py
Function: f at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def f():
6 1 100057.6 100057.6 14.3 time.sleep(0.1)
7 1 300195.1 300195.1 42.9 (g(1), g(2))
8 1 300221.3 300221.3 42.9 g(3)
Total time: 1.00071 s
File: /home/arappold/git/line_profiler/sub.py
Function: g at line 11
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def g(n):
13 9 2.4 0.3 0.0 if n == 0:
14 3 0.5 0.2 0.0 return
15
16 6 600365.7 100060.9 60.0 time.sleep(0.1)
17 6 400342.9 66723.8 40.0 g(n - 1)
"Total time" reported for g
is wrong because its summing up cumulative times which don't add up for recursive functions. Total times would, but the default output should remain cumulative times. So maybe remove the "% Time" column if its a recursive function (can be inferred from call stats)?
An example of what call stats would allow: graph.dot.pdf created with https://github.com/notEvil/line_profiler/blob/sub/graph.py edit: keep in mind that generators create "calls" on every next/yield and therefore directly into sub functions
I'm happy to announce the next iteration :) Getting useful numbers in case of recursive code was a lot more difficult than anticipated, but here they are nonetheless. The last commit
block_times
: how much time is spent in each function,call_times
: how much time is spent in the function calledIt is impossible to derive block_times
and call_times
from the line timings nor from call stats since in the presence of recursion, cumulative times don't add up.
Now the output for the initial example is
Timer unit: 1e-06 s
Total time: 0.700534 s
File: /home/arappold/git/line_profiler/example1.py
Function: f at line 4
Line # Hits Time Per Hit % Time Line Contents
==============================================================
4 @profile
5 def f():
6 1 100061.6 100061.6 14.3 time.sleep(0.1)
7 1 300238.6 300238.6 42.9 (g(1), g(2))
8 1 300233.5 300233.5 42.9 g(3)
Total time: 0.600456 s
File: /home/arappold/git/line_profiler/example1.py
Function: g at line 11
Line # Hits Time Per Hit % Time Line Contents
==============================================================
11 @profile
12 def g(n):
13 9 3.6 0.4 0.0 if n == 0:
14 3 0.8 0.3 0.0 return
15
16 6 600380.3 100063.4 100.0 time.sleep(0.1)
17 6 300267.2 50044.5 50.0 g(n - 1)
Thanks for this great work! I'm going to rebase my PR to your fork on this latest work, could you take a look at it once I do that? After this work of yours is merged, I can start working on adding CPython 3.12 support to line_profiler.
Thanks :) I would still consider it work-in-progress. The call stats feature is somewhat out of scope of line profiling, naming of stuff breaks with former conventions (which is different from the more common convention of "total" vs "cumulative"), some parts are pretty ugly, the default output from CLI could make better use of the available information, ... And there is performance, which has to be worse than before (when both are equally optimized). It would be great if someone more experienced with the existing code base could review the fork and propose changes.
I just tested the performance, and it is indeed significantly worse than before. Previously, https://github.com/notEvil/line_profiler/pull/1 was able to pull it down to 13x overhead over no profiling, in the tight loop case from above, whereas now with the same optimizations applied, it'd 16x overhead over no profiling. That being said, I believe I can optimize it further, as some of your data structure choices may be suboptimal performance-wise.
@notEvil thanks for sharing this! many years ago i wrote a line profiler for another dynamic language runtime and I think I used a similar call/return stacking approach.
if you get a chance, would you mind taking a look at the merge conflicts pulling in the latest main
after the recent autoprofiler changes. would love to try the two together.
@tmm1 sry for not responding! I will check the conflicts soon, I promise ^^
I've pushed a new commit* which should improve confidence in the solution. The previous version was manually tested on a few examples with manual inspection of the output. graph.py
was very helpful, but obviously we'd want something automated with a lot more variation. So I started the journey of implementing strategies for generating code with Hypothesis**, implementing a reference profiler and using it to validate the numbers. And it took a while to work as I wanted it to (statements spanning multiple lines, recursions, generators, async, ...), to look reasonably good and eventually for the tests to pass. So my apologies for keeping this private and not providing any updates.
The testing is not perfect:
time.sleep
What I like:
So now that it "works" and won't get any more complex, I'd say its a good time to review and benchmark.
https://github.com/notEvil/line_profiler/tree/sub
* changes
graph.py
to show the additional informationLineTime
and Timing
to CLineProfile
and LineProfile
respectivelyCBlockProfile
and CCallProfile
, and classes BlockProfile
and CallProfile
** If you don't know Hypothesis yet, I'd definitely recommend checking it out!
@tmm1 just merged main and except for trivial conflicts it went smooth. The randomized tests still pass :) Beware though that line_profiler.py
isn't adjusted yet and might show unexpected things, especially aggregates.
@Theelx you created a PR back then for better performance and comparisons. I didn't merge because I didn't want to mix the two proposals and it looked like it was based off of https://github.com/pyutils/line_profiler/pull/203 but without git reference (merge). Would you consider the PR mature enough to merge with sub on a test branch?
Just a quick update: I tried to measure and improve performance and got some interesting results. The benchmark is https://gitlab.com/notEvil/code_collapse applied to itself which spends most of the time in Black and the remaining time in nested generators (tree traversal) and LibCST.
name | Python 3.11 | vs baseline | Python 3.12 | vs baseline |
---|---|---|---|---|
baseline | 16.1 | 34.3 | ||
v4.1.2 | 27.2 | 168.9% | 178.6 | 520.7% |
sub | 28.2 | 175.2% | 167.1 | 487.2% |
key opt | 28 | 173.9% | 161.6 | 471.1% |
tss opt* | 27.6 | 171.4% | ||
preshed opt* | 27.2 | 168.9% | ||
map opt* | 26.3 | 163.4% | ||
time opt* | 25.8 | 160.3% | ||
all opt | 25.2 | 156.5% | 150.9 | 439.9% |
address of bytecode << 24 + line number
as key (with lower bits of address as offset for some randomness)
threading.get_ident()
parallel_flat_hash_map
and flat_hash_map
from parallel-hashmap instead of unordered_map
As you can see, I wasn't able to significantly improve performance of sub. Maybe time opt is worth the tradeoff (creates some bias), what do you think?
I will try to figure out why Python 3.12 is performing so bad compared to 3.11 on my project. Regarding the bad performance when line profiling on 3.12, can you reproduce this? e.g. jsonpickle
Update 2023-11-04
Python 3.12 is a bit slower than 3.11 in general, but the large difference is due to Black not compiling for 3.12 (due to a bug in Mypyc). So the benchmark is flawed in so far as it presents a measure of slowdown that is not representative (most time is spent in compiled code which doesn't add any overhead). Maybe we could run pyperformance somehow.
Also regarding key opt, turns out it is not uncommon for short functions to have the exact same bytecode (get_metadata
in pyperf
and at least two in re
) and therefore the same hash. So sub is currently broken (https://github.com/notEvil/line_profiler/blob/f0c17516e1012dd99e9e161a6a5057a97b60a02b/line_profiler/_line_profiler.pyx#L593) and current line_profiler is affected if the line ranges of such functions intersect. key opt would solve it.
* either directly from https://github.com/pyutils/line_profiler/pull/203 or inspired by
Good news I guess:
I managed to adjust pyperformance so it uses line_profiler (with autoprofile) and ran it*. See https://github.com/notEvil/line_profiler/tree/sub/pyperformance and https://github.com/notEvil/pyperformance/tree/line_profiler if you want to do the same. Set up the Pipenv environment and run prepare.py
to prepare an environment. Then run pyperformance in that environment.
The column baseline
is the average time in ms, the others are scaling factors (e.g. 1.5 means 1.5 * baseline). Some benchmarks like 2to3 spawn subprocesses which explains some ~1 factors. Didn't investigate the < 1 yet. nevermind, should've used pyperf to load the json from the start.
Update 2023-11-06
Added the opt variants and sorted by average factor. Interestingly, the success of time opt indicates benchmarks where the entry check fails a lot. However, map opt is on par and doesn't add bias. So overall my take is: preshed is not worth it**, time opt might be and map opt, tss opt and key opt are. What is still strange is that sub and key opt are so different although they use a very similar approach. (sub used id()
while key opt a type cast. With type cast, sub is a lot closer to key opt now).
* on Intel 4570S with 3 of them in parallel (1 idle core). Most nan are due to conflicts (e.g. bind to port). ** not better than key opt
Hi,
I've been tinkering with line_profiler and implemented a different approach to line profiling:
https://github.com/notEvil/line_profiler/tree/sub
Essentially, it keeps track of "calls" (
PyTrace_CALL
andPyTrace_RETURN
are used for generators and async as well) and by doing so records "total time": time spent within a function call, generator or async step, excluding time spent in line profiled functions that were called. From there it calculates "cumulative time".This approach has multiple benefits:
Potential downside is increased overhead, but in my testing it wasn't much slower.
I've also created a Neovim plugin (not published yet) to see the differences:
Tested on a real application (40s runtime, no async)