urule99 / jsunpack-n

Automatically exported from code.google.com/p/jsunpack-n
GNU General Public License v2.0
162 stars 65 forks source link

Evaluate the use of an alternative html parser for better performance #16

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I am thinking about how to fasten the html parsing and have found this article 
about python html-parsers: 

http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

According to which lxml is the fastest python parser because it is only a type 
of python binding to the underlying libxml and libxslt libraries. 

Further analysis reveals that the latest beta version of the BeautifulSoup 
Parser 4.x is supporting this parser as the underlying engine. 
Thus bringing me to the conclusion that patching jsunpack to use lxml as the 
html parser would be only a small patch which might be something like this:
From (in html.py):
import BeautifulSoup
...
soup = BeautifulSoup.BeautifulSoup(data)
soup.findAll(tag,attrib)
To:
import bs4
soup = bs4.BeautifulSoup(data)
soup.find_all(tag,attrib)

(And tests/test_lxml.py contains a sample of how to use lxml as a bs4.builder)

What do you think?

Regards

Ali

Original issue reported on code.google.com by ali.iki...@gmail.com on 20 Jul 2011 at 6:11

GoogleCodeExporter commented 9 years ago
Ali, thanks for the suggestion! I'll be testing this to see whether I want to 
integrate it.

Original comment by urul...@gmail.com on 25 Jul 2011 at 2:33

GoogleCodeExporter commented 9 years ago
Added support for BeautifulSoup v4 with builtin lxml support. It makes a huge 
performance difference.

Original comment by ali.iki...@gmail.com on 29 Oct 2011 at 5:22