html5lib no quicker or slower than CPython

GoogleCodeExporter commented 8 years ago

Using html5lib tip (http://code.google.com/p/html5lib), u-s is no quicker 
(in 2009Q3) or slower (in 2009Q4) than CPython at parsing the HTML 5 spec 
(http://www.whatwg.org/specs/web-apps/current-work/).

Attached is both test script and profile data (though I've manually 
removed the (long) path to my copy of html5lib for readability's sake).

Original issue reported on code.google.com by geoffers on 6 Jan 2010 at 6:30

Blocked on: #86, #88, #113, #114

Attachments:

GoogleCodeExporter commented 8 years ago

You are running your benchmark under cProfile.  This will enable tracing and 
will
disable the use of JITted code.  Could you try again and just produce one time 
for
the entire run without using cProfile?  Try using time.time().

If you want to give Unladen more of an advantage, try running the benchmark 
once and
then timing a second run.  This will give the code a chance to "warm up" but 
probably
won't be directly relevant to your real-world use unless you want to write a 
daemon
that never shuts down.

Original comment by abbeyj on 6 Jan 2010 at 6:44

GoogleCodeExporter commented 8 years ago

Done.

Original comment by geoffers on 6 Jan 2010 at 6:55

Attachments:

profileb.txt

GoogleCodeExporter commented 8 years ago

Is http://svn.whatwg.org/webapps/index the input data you're using?

Original comment by collinw on 7 Jan 2010 at 12:39

Added labels: Performance

GoogleCodeExporter commented 8 years ago

I whipped up a benchmark based on html5lib that integrates with perf.py, 
parsing 
the full HTML 5 spec with html5lib. That benchmark indicates that Unladen is 
faster 
than CPython 2.6.1, but only after the first run.

http://tinyurl.com/ydomq7t is a timeline of iteration times for CPython 2.6 vs 
Unladen trunk@head, each parsing the HTML 5 spec 10 times. CPython wins on the 
first iteration, but after that, Unladen's performance appears to improve over 
time. 
Note that this benchmark intentionally isn't doing any priming; I want us to be 
penalized appropriately for that first slow run, since this isn't a daemon 
process 
we're talking about.

I still think it's a problem that we're only a second faster than CPython. I 
need to do 
some profiling to see why, but the time Unladen spends blocked on compilation 
is 
certainly a factor.

Original comment by collinw on 7 Jan 2010 at 7:59

Changed state: Accepted

GoogleCodeExporter commented 8 years ago

The provided profiling information indicates a problem with the hotness 
function:

We have a function with a total runtime of 14.6 sec which is not considered hot:
337750   14.619    0.000   62.482    0.000  html5lib/tokenizer.py:59(__iter__)

Original comment by joerg...@gmail.com on 7 Jan 2010 at 8:09

GoogleCodeExporter commented 8 years ago

Aha, this will be an excellent test case for the while-loop hotness patch!

Original comment by collinw on 7 Jan 2010 at 8:12

GoogleCodeExporter commented 8 years ago

Benchmark added in r970.

Original comment by collinw on 8 Jan 2010 at 12:35

GoogleCodeExporter commented 8 years ago

With my patch from http://codereview.appspot.com/167061

### html5lib ###
Min: 15.609000 -> 14.156000: 1.1026x faster
Avg: 16.285900 -> 15.096900: 1.0788x faster
Not significant
Stddev: 1.24404 -> 1.64767: 1.3245x larger

Original comment by abbeyj on 8 Jan 2010 at 12:53

GoogleCodeExporter commented 8 years ago

From the profile:
337750   14.619    0.000   62.482    0.000  html5lib/tokenizer.py:59(__iter__)

Looking at this more closely, I think this function (and functions like it) is 
a good 
candidate/motivation for pursuing on-stack replacement.

- This is a while-loop heavy generator.
- For a given run of html5lib, this function is only invoked once. The high 
call count 
comes from counting each .next() call on the generator object.
- Even if while loops are counted as hot (as in abbeyj's patch), this function 
won't be 
compiled because it's only called once. Despite this, the function still racks 
up a high 
hotness score of 1641014, more than enough to be compiled.

I know that the JVM's primary benefit from OSR was in benchmarks (not 
real-world 
code), but does Java have anything like generators that promote long-lived 
function 
"calls" like this?

We don't strictly need OSR; special-casing for generators might be enough. If 
we 
checked for hotness every time we reentered the generator, that would be enough.

Original comment by collinw on 8 Jan 2010 at 1:04

GoogleCodeExporter commented 8 years ago

Running abbeyj's patch through the new Misc/diff_hotness.py tool I added in 
r974 gives this:

### Newly-hot functions
    html5lib/tokenizer.py:59 (__iter__) now has 1641024 points

### Hotter functions
    html5lib/tokenizer.py:146 (consumeEntity) -> +14438 points
    html5lib/inputstream.py:405 (charsUntil) -> +55 points

### Colder functions
    html5lib/html5parser.py:1003 (startTagListItem) -> -9091 points
    html5lib/treebuilders/_base.py:131 (elementInScope) -> -7403 points
    html5lib/html5parser.py:1480 (endTagOther) -> -9091 points
    html5lib/treebuilders/_base.py:201 (elementInActiveFormattingElements) -> -5916 points

Example command: PYTHONPATH=lib/html5lib/ ../hg/fc-obj-inst/python.exe 
performance/bm_html5lib.py -n 2; the -n2 is necessary because of my 
above message.

Looks good. I'll look over the impact on other benchmarks.

Original comment by collinw on 8 Jan 2010 at 1:32

GoogleCodeExporter commented 8 years ago

Misc notes:

- html5parser.py:196(normalizedTokens) is another candidate for pseudo-OSR 
(same reasons as 
html5lib/tokenizer.py:59(__iter__)).
- str.startswith() should be converted to METH_ARG_RANGE (494557 calls per run; 
easy win).
- We should find a way to specialize isinstance() calls if a) the given object 
and type are monomorphic, b) type(given_object) 
is given_type.
- Inlining will help a lot; there are a number of single-line methods that show 
up as very hot that should be inlined into their 
callers.
- --with-instrumentation for bail sites needs to be extended to track frequency.

I'll file individual bugs for these in the morning.

Original comment by collinw on 8 Jan 2010 at 8:44

GoogleCodeExporter commented 8 years ago

r980 added while loops into the hotness model, meaning that 
"html5lib/tokenizer.py:59 (__iter__)" is now 
being compiled to machine code.

### html5lib ###
Min: 13.185609 -> 12.256459: 1.0758x faster
Avg: 13.876981 -> 12.997757: 1.0676x faster
Not significant
Stddev: 1.15007 -> 1.41818: 1.2331x larger
Timeline: http://tinyurl.com/y8aldmr

If you look at the timeline graph, r980 is about a second faster, just not on 
the first run (which increases 
the stddev, which lowers the significance score).

Original comment by collinw on 8 Jan 2010 at 10:43

GoogleCodeExporter commented 8 years ago

r982: --with-instrumentation now tracks bail site frequency.

Original comment by collinw on 12 Jan 2010 at 12:45

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 12 Jan 2010 at 2:49

GoogleCodeExporter commented 8 years ago

Original comment by collinw on 12 Jan 2010 at 2:49

GoogleCodeExporter commented 8 years ago

r1029 added a separate no-warmup html5lib benchmark. The old html5lib benchmark 
is now called html5lib_warmup.

Original comment by collinw on 22 Jan 2010 at 10:11

GoogleCodeExporter commented 8 years ago

Reid implemented pseudo-OSR for generators in r1032.

### html5lib_warmup ###
Min: 12.435418 -> 12.430771: 1.0004x faster
Avg: 13.130518 -> 13.066833: 1.0049x faster
Not significant
Stddev: 1.45305 -> 1.15964: 1.2530x smaller
Timeline: http://tinyurl.com/ykyvzxn

Let's look at that first run in more detail:

### html5lib ###
Min: 17.347363 -> 16.364512: 1.0601x faster
Avg: 17.382957 -> 16.393608: 1.0603x faster
Significant (t=85.867780, a=0.95)
Stddev: 0.02557 -> 0.02596: 1.0154x larger
Timeline: http://tinyurl.com/yconbz5

My testing shows this shaving a full second off the html5lib runtime. Woot!

Current benchmarking vs CPython 2.6.4:

### html5lib ###
Min: 14.016869 -> 16.396508: 1.1698x slower
Avg: 14.059663 -> 16.447700: 1.1699x slower
Significant (t=-175.046714, a=0.95)
Stddev: 0.02967 -> 0.03132: 1.0555x larger
Timeline: http://tinyurl.com/ycy927u

This was using the new html5lib, not the old html5lib (now called 
html5lib_warmup), so it's not comparable to above 
results. We're still slower than CPython, but we're gaining ground.

Original comment by collinw on 22 Jan 2010 at 11:40

leepro / unladen-swallow

html5lib no quicker or slower than CPython #105