Open GoogleCodeExporter opened 9 years ago
I think this might be more of a problem with the way we decide if a function is
hot.
We need to take into account not just the number of calls to a function, but also
how much time is spent in it. For example, if I write a fat for loop that
spins for
a long time, that's a good candidate for compilation. Obviously, that's
trickier to
implement, but we should think about going that way.
Original comment by reid.kle...@gmail.com
on 6 Jun 2009 at 12:53
Hi,
I once attended a lecture by one of V8 developers, where they described a
primitive method of identifying hot
functions: basically, they set up a timer, and whenever it fired, they compiled
the executing function. (I believe they
said it was used in an early version of a JVM, but I can't remember if it was
HotSpot or the embedded VM.) Have you
considered following a similar strategy? Perhaps with the modification that you
multiply your counter with some
fixed value?
Oh, and on a similar note, have you considered using the
@llvm.readcyclecounter() intrinsic? It seems like it might
be a useful and portable way of doing timing. It's documented here:
<http://llvm.org/docs/LangRef.html#int_readcyclecounter>
Please forgive me if I'm stating the obvious; apart from being an interested
observer, I don't have an awful lot of
experience in the field of VMs…
- Dan Villiom Podlaski Christiansen
Original comment by danchr
on 7 Jun 2009 at 9:16
I suppose the timer would work, and you could even make the modification that
the
timer updates a counter in the currently executing function, and when that
reaches a
threshold it's "hot".
Alternatively, a lower overhead way to do this would be to increment the
counter at
the end of every basic block with the number of opcodes in the block. Then you
have
a rough estimate of time spent in the function. This is still pretty high
overhead,
though.
The LLVM intrinsic isn't particularly useful because all of this bookkeeping
about
when to compile the function must be done in the bytecode interpreter eval
loop.
cycle counters or time stamp counters are also notoriously fickle and in the
presence
of multiple cores they can be non-monotonic. It's also not supported on all
platforms.
Original comment by reid.kle...@gmail.com
on 7 Jun 2009 at 9:03
[Changing the issue title to be less prescriptive]
The timer is an interesting idea; I'm not wedded to the decorator approach. I
care
more that these kinds of functions are optimized.
That said, we need to keep in mind that when code like Spitfire is run in
production,
it will be running orders of magnitude more times than we're running it in our
benchmarks. Production environments can afford to "prime" the functions,
elevating
them to "hot" status artificially so that they don't serve slow queries while
Unladen
Swallow is trying to figure out if the function is hot or not.
Long-term, we can handle with this feedback-directed optimization at the Python
level, allowing you to train the optimizers and preconfigure which functions
are hot
without manually tagging them. In the short term, though, the manual tags might
serve
as an easy-to-implement reference mark. The manual tags can also tell us what
the
upper bound on improvement is, without having to pay the human cost that -L2
imposes.
I'm open to suggestions.
Original comment by collinw
on 8 Jun 2009 at 6:38
Original comment by collinw
on 26 Jun 2009 at 5:00
The approach we'll probably go with is using elapsed ticks, rather than call
count. The
existing instrumentation in the Py_PROFILE_HOTNESS special build should be good
enough to get started on this.
Original comment by collinw
on 22 Jul 2009 at 6:54
Original comment by collinw
on 29 Sep 2009 at 5:57
Original comment by collinw
on 9 Oct 2009 at 6:00
As of r862, the main loops for slowspitfire and ai now show up as hot:
cProfile data vs heuristic results
python performance/bm_spitfire.py -n 10 --disable_psyco --profile
====================================================
============================
36040740 function calls (36039831 primitive calls) in 26.656 CPU
seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
12 17.350 1.446 25.897 2.158 <spitfire_tmpl_o4>:11(main)
36027420 5.639 0.000 5.639 0.000 {method 'append' of 'list'
objects}
32 2.908 0.091 2.908 0.091 {method 'join' of 'str' objects}
Old heuristic: 0 code objects deemed hot
New heuristic (v1): 1 code objects deemed hot
<spitfire_tmpl_o4>:11 (main) 1001120
python performance/bm_ai.py -n 10 --profile
====================================================
============================
10415408 function calls in 12.973 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
483852 6.601 0.000 8.446 0.000 bm_ai.py:25(permutations)
1116 2.799 0.003 12.973 0.012 bm_ai.py:50(n_queens)
4354560 1.562 0.000 1.562 0.000 bm_ai.py:64(<genexpr>)
4354452 1.509 0.000 1.509 0.000 bm_ai.py:43(<genexpr>)
483876 0.336 0.000 0.336 0.000 {range}
Old heuristic: 3 code objects deemed hot
bm_ai.py:64 (<genexpr>) 483840
bm_ai.py:43 (<genexpr>) 483828
bm_ai.py:65 (<genexpr>) 25356
New heuristic (v1): 5 code objects deemed hot
bm_ai.py:64 (<genexpr>) 4882848
bm_ai.py:43 (<genexpr>) 4882728
bm_ai.py:65 (<genexpr>) 298008
bm_ai.py:50 (n_queens) 121080
bm_ai.py:25 (permutations) 109720
There's still more to do, though, so I'm changing the issue title to be more
generic.
I'll follow up in a subsequent update with problems with the new heuristic.
Original comment by collinw
on 9 Oct 2009 at 9:28
The current (as of r862) hotness heuristic is
100000 points == hot; call -> 10 points; for loop backedge -> 1 point.
This does not take into account while loop backedges, because those are
implemented as JUMP_ABSOLUTE opcodes and its not obvious that we want to count
all
JUMP_ABSOLUTE executions toward hotness.
Looking at the bm_ai.py results above, permutations() should be showing up as
hotter than it is. Part of that is the while loop in permutations() that we're
not
taking into account. The other problem is that n_queens() and permutations()
make heavy use of generator expressions, which don't count toward their
parent's
hotness. Those are the three separate genexpr lines you see the heuristics
picking up. These should be dealt with by inlining them into their callers
(issue 86).
We should have all JUMP_ABSOLUTE opcodes increment the hotness counter by 1.
Alternatively, add a specialized WHILE_BACKEDGE opcode that increments the
hotness count and then jumps to an absolute address (this option is more
invasive).
I'm sure there's literature on this. Or maybe I should just look at the JVM and
JS engines and use their ideas wholesale.
Original comment by collinw
on 9 Oct 2009 at 10:27
Can't you count only JUMP_ABSOLUTE opcodes that have a negative relative
displacement?
Original comment by abbeyj
on 9 Oct 2009 at 10:32
+1 for fixing while loops via looking for negative relative jumps. It may not
matter
for our benchmarks, but lots of benchmark code out there uses while loops.
Original comment by reid.kle...@gmail.com
on 18 Nov 2009 at 8:56
Original comment by collinw
on 6 Jan 2010 at 11:42
r974 added a tool useful for highlighting the impact of changes to the hotness
function.
Original comment by collinw
on 8 Jan 2010 at 12:57
r978 added more tests for how the hotness model treats loops.
Original comment by collinw
on 8 Jan 2010 at 8:26
As of r980, we properly increment hotness based on loop backedges, rather than
simulating it in the loop header. This adds while loops into the hotness model.
Important items:
html5lib:
### Newly-hot functions
html5lib/tokenizer.py:59 (__iter__) now has 1641024 points
2to3:
### Newly-hot functions
lib2to3/pgen2/tokenize.py:320 (generate_tokens) now has 159872 points
Changing bm_html5lib.py to do a priming run (to show off the maximum
performance) yields this:
### html5lib ###
Min: 13.392465 -> 12.383962: 1.0814x faster
Avg: 13.690497 -> 12.732577: 1.0752x faster
Significant (t=3.928796, a=0.95)
Stddev: 0.32880 -> 0.43489: 1.3226x larger
Timeline: http://tinyurl.com/y8vskug
That one function is shaving a full second off the run time. Win!
Patch by James Abbatiello!
Original comment by collinw
on 13 Jan 2010 at 10:39
Original issue reported on code.google.com by
collinw
on 5 Jun 2009 at 11:47