leepro / unladen-swallow

Automatically exported from code.google.com/p/unladen-swallow
Other
0 stars 0 forks source link

Tune hotness function #48

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
We should add a sys.optimize decorator so that known-important functions
don't have to hit the hotness threshold before we optimize them. This would
be particularly useful for the slowspitfire benchmark: the function that
does all the work will never be "hot" by our current heuristics.

Currently, if we force compilation via -L2, slowspitfire shows a ~10% gain
over 2009Q1, but -L2 hurts start-up time. A decorator like this is similar
to the way Spitfire uses Psyco (ie, explicitly flagging functions for
optimization).

Original issue reported on code.google.com by collinw on 5 Jun 2009 at 11:47

GoogleCodeExporter commented 9 years ago
I think this might be more of a problem with the way we decide if a function is 
hot.
 We need to take into account not just the number of calls to a function, but also
how much time is spent in it.  For example, if I write a fat for loop that 
spins for
a long time, that's a good candidate for compilation.  Obviously, that's 
trickier to
implement, but we should think about going that way.

Original comment by reid.kle...@gmail.com on 6 Jun 2009 at 12:53

GoogleCodeExporter commented 9 years ago
Hi,

I once attended a lecture by one of V8 developers, where they described a 
primitive method of identifying hot 
functions: basically, they set up a timer, and whenever it fired, they compiled 
the executing function. (I believe they 
said it was used in an early version of a JVM, but I can't remember if it was 
HotSpot or the embedded VM.) Have you 
considered following a similar strategy? Perhaps with the modification that you 
multiply your counter with some 
fixed value?

Oh, and on a similar note, have you considered using the 
@llvm.readcyclecounter() intrinsic? It seems like it might 
be a useful and portable way of doing timing. It's documented here:
<http://llvm.org/docs/LangRef.html#int_readcyclecounter>

Please forgive me if I'm stating the obvious; apart from being an interested 
observer, I don't have an awful lot of 
experience in the field of VMs…

- Dan Villiom Podlaski Christiansen

Original comment by danchr on 7 Jun 2009 at 9:16

GoogleCodeExporter commented 9 years ago
I suppose the timer would work, and you could even make the modification that 
the
timer updates a counter in the currently executing function, and when that 
reaches a
threshold it's "hot".

Alternatively, a lower overhead way to do this would be to increment the 
counter at
the end of every basic block with the number of opcodes in the block.  Then you 
have
a rough estimate of time spent in the function.  This is still pretty high 
overhead,
though.

The LLVM intrinsic isn't particularly useful because all of this bookkeeping 
about
when to compile the function must be done in the bytecode interpreter eval 
loop. 
cycle counters or time stamp counters are also notoriously fickle and in the 
presence
of multiple cores they can be non-monotonic.  It's also not supported on all 
platforms.

Original comment by reid.kle...@gmail.com on 7 Jun 2009 at 9:03

GoogleCodeExporter commented 9 years ago
[Changing the issue title to be less prescriptive]

The timer is an interesting idea; I'm not wedded to the decorator approach. I 
care
more that these kinds of functions are optimized.

That said, we need to keep in mind that when code like Spitfire is run in 
production,
it will be running orders of magnitude more times than we're running it in our
benchmarks. Production environments can afford to "prime" the functions, 
elevating
them to "hot" status artificially so that they don't serve slow queries while 
Unladen
Swallow is trying to figure out if the function is hot or not.

Long-term, we can handle with this feedback-directed optimization at the Python
level, allowing you to train the optimizers and preconfigure which functions 
are hot
without manually tagging them. In the short term, though, the manual tags might 
serve
as an easy-to-implement reference mark. The manual tags can also tell us what 
the
upper bound on improvement is, without having to pay the human cost that -L2 
imposes.

I'm open to suggestions.

Original comment by collinw on 8 Jun 2009 at 6:38

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 26 Jun 2009 at 5:00

GoogleCodeExporter commented 9 years ago
The approach we'll probably go with is using elapsed ticks, rather than call 
count. The 
existing instrumentation in the Py_PROFILE_HOTNESS special build should be good 
enough to get started on this.

Original comment by collinw on 22 Jul 2009 at 6:54

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 29 Sep 2009 at 5:57

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 9 Oct 2009 at 6:00

GoogleCodeExporter commented 9 years ago
As of r862, the main loops for slowspitfire and ai now show up as hot:

cProfile data vs heuristic results

python performance/bm_spitfire.py -n 10 --disable_psyco --profile
====================================================
============================

         36040740 function calls (36039831 primitive calls) in 26.656 CPU
seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       12   17.350    1.446   25.897    2.158 <spitfire_tmpl_o4>:11(main)
 36027420    5.639    0.000    5.639    0.000 {method 'append' of 'list'
objects}
       32    2.908    0.091    2.908    0.091 {method 'join' of 'str' objects}

Old heuristic: 0 code objects deemed hot

New heuristic (v1): 1 code objects deemed hot
        <spitfire_tmpl_o4>:11 (main)    1001120

python performance/bm_ai.py -n 10 --profile
====================================================
============================

         10415408 function calls in 12.973 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   483852    6.601    0.000    8.446    0.000 bm_ai.py:25(permutations)
     1116    2.799    0.003   12.973    0.012 bm_ai.py:50(n_queens)
  4354560    1.562    0.000    1.562    0.000 bm_ai.py:64(<genexpr>)
  4354452    1.509    0.000    1.509    0.000 bm_ai.py:43(<genexpr>)
   483876    0.336    0.000    0.336    0.000 {range}

Old heuristic: 3 code objects deemed hot
        bm_ai.py:64 (<genexpr>)     483840
        bm_ai.py:43 (<genexpr>)     483828
        bm_ai.py:65 (<genexpr>)      25356

New heuristic (v1): 5 code objects deemed hot
        bm_ai.py:64 (<genexpr>)     4882848
        bm_ai.py:43 (<genexpr>)     4882728
        bm_ai.py:65 (<genexpr>)      298008
        bm_ai.py:50 (n_queens)       121080
        bm_ai.py:25 (permutations)   109720

There's still more to do, though, so I'm changing the issue title to be more 
generic. 
I'll follow up in a subsequent update with problems with the new heuristic.

Original comment by collinw on 9 Oct 2009 at 9:28

GoogleCodeExporter commented 9 years ago
The current (as of r862) hotness heuristic is

100000 points == hot; call -> 10 points; for loop backedge -> 1 point.

This does not take into account while loop backedges, because those are 
implemented as JUMP_ABSOLUTE opcodes and its not obvious that we want to count 
all 
JUMP_ABSOLUTE executions toward hotness.

Looking at the bm_ai.py results above, permutations() should be showing up as 
hotter than it is. Part of that is the while loop in permutations() that we're 
not 
taking into account. The other problem is that n_queens() and permutations() 
make heavy use of generator expressions, which don't count toward their 
parent's 
hotness. Those are the three separate genexpr lines you see the heuristics 
picking up. These should be dealt with by inlining them into their callers 
(issue 86).

We should have all JUMP_ABSOLUTE opcodes increment the hotness counter by 1. 
Alternatively, add a specialized WHILE_BACKEDGE opcode that increments the 
hotness count and then jumps to an absolute address (this option is more 
invasive).

I'm sure there's literature on this. Or maybe I should just look at the JVM and 
JS engines and use their ideas wholesale.

Original comment by collinw on 9 Oct 2009 at 10:27

GoogleCodeExporter commented 9 years ago
Can't you count only JUMP_ABSOLUTE opcodes that have a negative relative 
displacement?

Original comment by abbeyj on 9 Oct 2009 at 10:32

GoogleCodeExporter commented 9 years ago
+1 for fixing while loops via looking for negative relative jumps.  It may not 
matter
for our benchmarks, but lots of benchmark code out there uses while loops.

Original comment by reid.kle...@gmail.com on 18 Nov 2009 at 8:56

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 6 Jan 2010 at 11:42

GoogleCodeExporter commented 9 years ago
r974 added a tool useful for highlighting the impact of changes to the hotness 
function.

Original comment by collinw on 8 Jan 2010 at 12:57

GoogleCodeExporter commented 9 years ago
r978 added more tests for how the hotness model treats loops.

Original comment by collinw on 8 Jan 2010 at 8:26

GoogleCodeExporter commented 9 years ago
As of r980, we properly increment hotness based on loop backedges, rather than 
simulating it in the loop header. This adds while loops into the hotness model.

Important items:

html5lib:
### Newly-hot functions
       html5lib/tokenizer.py:59 (__iter__) now has 1641024 points

2to3:
### Newly-hot functions
       lib2to3/pgen2/tokenize.py:320 (generate_tokens) now has 159872 points

Changing bm_html5lib.py to do a priming run (to show off the maximum
performance) yields this:

### html5lib ###
Min: 13.392465 -> 12.383962: 1.0814x faster
Avg: 13.690497 -> 12.732577: 1.0752x faster
Significant (t=3.928796, a=0.95)
Stddev: 0.32880 -> 0.43489: 1.3226x larger
Timeline: http://tinyurl.com/y8vskug

That one function is shaving a full second off the run time. Win!

Patch by James Abbatiello!

Original comment by collinw on 13 Jan 2010 at 10:39