6% regression in Django benchmark

GoogleCodeExporter commented 9 years ago

Django is showing a 6% performance regression against our 2009Q1 release.
Jeffrey's theory is that Django may be throwing a lot of exceptions
internally, which is something we've probably slowed down.

2009Q1 vs trunk@r610 (-L2; gcc 4.3.1; 64-bit)
perf.py options: --args=,-L2 -r

django:
Min: 1.141 -> 1.215: 6.05% slower
Avg: 1.150 -> 1.231: 6.59% slower
Significant (t=-25.952955, a=0.95)

Original issue reported on code.google.com by collinw on 8 Jun 2009 at 6:08

GoogleCodeExporter commented 9 years ago

The exception idea is a good bet.  Django's variable resolution uses the same 
syntax
for attribute lookup, index lookup, and key lookup, this means that each nested
resolution probably averages 1-3 exceptions, plus any other exceptions in 
template
rendering.

Original comment by alex.gay...@gmail.com on 8 Jun 2009 at 6:14

GoogleCodeExporter commented 9 years ago

So I added code to gather performance measurements for exceptions in r620, and 
we
were surprised to find that Django doesn't actually throw any exceptions while
rendering templates.  It seems that they use getattr with a default and then
introspect on the result to decide how to handle it.

That was a good bet, but now we need a new one.

Original comment by reid.kle...@gmail.com on 15 Jun 2009 at 6:58

GoogleCodeExporter commented 9 years ago

Django will throw plenty of exceptions in nested variable resoltuion:
http://code.djangoproject.com/browser/django/trunk/django/template/__init__.py#L
701 .
 However as it turns out that doesn't occur anywhere in the Unladen Swallow
benchmark.  Which is helpful in that it eliminates that as a possible source of 
the
regression, however it may mean that real world templates (which will often 
include
nested lookups) have even worse regressions.  I'll try to look into this later 
today.

Original comment by alex.gay...@gmail.com on 15 Jun 2009 at 7:03

GoogleCodeExporter commented 9 years ago

Huh.  I had no idea we had that problem with our benchmarks.  However, we did 
find
that exceptions aren't actually any slower in LLVM than they are in the eval 
loop, so
I don't think that the nested lookup should be any slower.

Original comment by reid.kle...@gmail.com on 15 Jun 2009 at 7:44

GoogleCodeExporter commented 9 years ago

Latest theory: Django templates are using eval or exec during their execution, 
which 
would incur JIT compilation costs under "-j always -O2" (formerly "-L2"). This 
turns out 
not to be the case, as the benchmark doesn't use eval or exec during the timed 
loop.

Original comment by collinw on 18 Jun 2009 at 2:06

GoogleCodeExporter commented 9 years ago

Running PyBench to find an localized regressions:

perf.py args: --verbose -r -b pybench --args ",-j always -O2" branches/release-
2009Q1-maint/python trunk/python

Interesting localized regressions that may be impacting Django:
ListSlicing:  1107ms  1800ms  -38.5%  1120ms  2010ms  -44.3%
SmallTuples:   899ms  1068ms  -15.8%   932ms  1184ms  -21.3%

Original comment by collinw on 18 Jun 2009 at 6:10

GoogleCodeExporter commented 9 years ago

I've manually traced the method calls and neither list slicing nor tuple 
handling
appear to be particularly stressed by the most called code paths (the inner loop
obviously):

http://code.djangoproject.com/browser/django/trunk/django/template/__init__.py#L
824
http://code.djangoproject.com/browser/django/trunk/django/template/__init__.py#L
794
http://code.djangoproject.com/browser/django/trunk/django/template/__init__.py#L
543
http://code.djangoproject.com/browser/django/trunk/django/template/defaultfilter
s.py#L379

I take in there weren't any other micro benchmarks that showed similar large
performance gaps?

Original comment by alex.gay...@gmail.com on 18 Jun 2009 at 7:34

GoogleCodeExporter commented 9 years ago

The problem isn't list slicing or tuple handling per se: it's the 
UNPACK_SEQUENCE 
opcode, which is used for, e.g., "a, b, c, d, e = range(5)". It's lines like 
this that are 
hurting those microbenchmarks.

Looking at 
http://code.djangoproject.com/browser/django/trunk/django/template/__init__.py#
L543, I see two interesting for loops, one at L562 and an inner loop at L564. 
These 
both use UNPACK_SEQUENCE to unpack the arguments in the "for func, args in 
self.filters", etc, lines.

I'm going to experiment with compiling _PyEval_UnpackIterable() with Clang so 
that 
LLVM can analyze it.

Original comment by collinw on 18 Jun 2009 at 8:12

Changed state: Started

GoogleCodeExporter commented 9 years ago

That's probably a good bet.  The UNPACK_SEQUENCE bytecode will be evaluated 
22500
times for the loop at 562, and never for the loop at 564 (it's an empty list 
every
single time through).

Original comment by alex.gay...@gmail.com on 18 Jun 2009 at 8:35

GoogleCodeExporter commented 9 years ago

I think I figured out what the issue is (again ;) ).  Currently there is no LLVM
definition for the IMPORT_NAME opcode, however this function:

http://code.djangoproject.com/browser/django/trunk/django/template/defaultfilter
s.py#L379

Which is in the inner loop (get's called 22500 times) uses an IMPORT_NAME, 
which I'm
guessing means it never get's JIT'd.  This means that it's a) only ever 
interpretted
and b) causes 22500 switches bewteeen JIT and non-JIT code which probably 
causes the
performance issue (unless of course I totally misunderstodd how import is 
handled ;) ).

Original comment by alex.gay...@gmail.com on 23 Jun 2009 at 5:18

GoogleCodeExporter commented 9 years ago

In Unladen Swallow, IMPORT_NAME is a function, not an opcode, so there 
shouldn't be 
any barrier to LLVM-compiling the Django function.

The string_filter() decorator that's being wrapped around escape() might be 
worth taking 
a look at. I don't know that *args, **kwargs functions have been optimized. 
Jeffrey 
exposed the stack manipulation in CALL_FUNCTION to LLVM in r665, and it would 
be 
worth doing something similar for the other CALL_FUNCTION_* opcodes.

Original comment by collinw on 23 Jun 2009 at 5:26

GoogleCodeExporter commented 9 years ago

Whoops.  If no one get's to the CALL_FUNCTION_<VAR|KW|VAR_KW> thing by tonight 
I'll
try taking a whack at it.

Original comment by alex.gay...@gmail.com on 23 Jun 2009 at 5:40

GoogleCodeExporter commented 9 years ago

Interesting:

2009Q1 vs trunk@r669 (-j always -O2; 32-bit; gcc 4.0.1)
django:
Min: 1.088 -> 1.031: 5.52% faster
Avg: 1.090 -> 1.032: 5.56% faster
Significant (t=290.883794, a=0.95)
Stddev: 0.002 -> 0.001: 156.47% smaller

I need to see if this is a 32- vs 64-bit issue, or if the original observations 
were 
incorrect, or if we've fixed it, or what. I haven't been able to reproduce the 
original 
observations, even on the original configuration (gcc 4.3.1; 64-bit), so it may 
have 
been measurement error, or we may have fixed things in the interim.

There're definitely still improvements to be made with UNPACK_SEQUENCE and 
CALL_FUNCTION_*, though. I've sent a patch out for review to add an 
unpack_sequence benchmark, and something similar should be done for 
CALL_FUNCTION_*.

Original comment by collinw on 24 Jun 2009 at 11:44

Added labels: Priority-Medium
Removed labels: Priority-Critical

GoogleCodeExporter commented 9 years ago

Original comment by collinw on 13 Jul 2009 at 5:51

Changed state: Fixed

rheostat2718 / unladen-swallow

6% regression in Django benchmark #52