kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Benchmarks comparing to other parsers #1

Closed alanhamlett closed 7 years ago

alanhamlett commented 7 years ago

How does this compare to https://github.com/tbodt/htmlpyever in performance?

kovidgoyal commented 7 years ago

No idea, it would be interesting to benchmark. From a quick look at the source code, it seems to use the same basic concept -- parse and transform tree into libxml2 in compiled code. However, it seems to be far less mature. And it uses cython generated C code rather than hand rolled code for the lxml tree construction part, which is usually a bit slower. On the other hand it uses a streaming API while html5-parser actually constructs the tree via gumbo and then duplicates it to lxml. Streaming APIs are usually a bit faster than duplicate construction.

alanhamlett commented 7 years ago

Would also be interested in benchmarks against:

kovidgoyal commented 7 years ago

pyquery is not (as far as I can tell) a parser. And beautifulsoup4 is pure python so it will be at least an order of magnitude slower than html5-parser.

htmlpyever and lxml.html.fromstring() are the only things that are likely to have comparable performance. And lxml.html does not use the HTML 5 parsing algorithm.

kovidgoyal commented 7 years ago

In fact, as far as I recall, bs4 recommends using html5lib for parsing.

jonathan-s commented 7 years ago

I would also like a comparison with lxml which has been around for years :)

kovidgoyal commented 7 years ago

lxml uses html5lib to do HTML 5 parsing. So comparing to it is exactly equivalent to comparing to html5lib with the lxml treebuilder, which is what the current benchmark does.

kovidgoyal commented 7 years ago

I have added some more comparisons to the benchmark script, output below:

Testing with HTML file of 5,956,815 bytes
Parsing 100 times with html5-parser
html5-parser took an average of: 0.389 seconds to parse it
Parsing 10 times with html5-parser-to-soup
html5-parser-to-soup took an average of: 3.248 seconds to parse it
Parsing 10 times with html5lib
html5lib took an average of: 13.499 seconds to parse it
Parsing 10 times with BeautifulSoup-with-html5lib
BeautifulSoup-with-html5lib took an average of: 12.661 seconds to parse it
Parsing 10 times with BeautifulSoup-with-lxml
BeautifulSoup-with-lxml took an average of: 3.643 seconds to parse it

Results are below. They show how much faster html5-parser is than each specified parser. 
Note that there are two additional considerations: what the final tree is and whether the parsing supports 
the HTML 5 parsing algorithm. The most apples-to-apples comparison is when the final tree is lxml and 
HTML 5 parsing is supported by the parser being compared to. Note that in this case, we have the largest 
speedup. In all other cases, speedup is less because of the overhead of building the final tree in python 
instead of C or because the compared parser does not use the HTML 5 parsing algorithm or both.

Parser            |Tree              |Supports HTML 5   |Speedup (factor)   |
===============================================================================
html5lib          |lxml              |yes               |35                |
soup+html5lib     |BeautifulSoup     |yes               |4                 |
soup+lxml.html    |BeautifulSoup     |no                |1                 |   

Unfortunately htmlpyever has no installation instructions and given that I am not familiar with rust, building it is a bridge too far for me. Contributions to the benchmarks are welcome.

alanhamlett commented 7 years ago

Thanks!

kovidgoyal commented 7 years ago

I just committed some code that doubles the speed when building BeautifulSoup trees.