Bad HTML parsing - Githubissues

Given the same HTML code, here is what different parsers see :

=== HTML ===
<li>
 one
 <div>
</li>
<li>
 two
</li>
=== parsel (lxml) (marginal interpretation) ===
<html><body><li>
 one
 <div>

<li>
 two
</li></div></li></body></html>
=== html.parser ===
<li>
 one
 <div>
 </div>
</li>
<li>
 two
</li>
=== lxml (same problem as parsel of course) ===
<html>
 <body>
  <li>
   one
   <div>
    <li>
     two
    </li>
   </div>
  </li>
 </body>
</html>
=== html5lib (Parses pages the same way a web browser does) ===
<html>
 <head>
 </head>
 <body>
  <li>
   one
   <div>
   </div>
  </li>
  <li>
   two
  </li>
 </body>
</html>

This is very annoying to parse something when the parsing is different from a web browser parsing. It would be a good addition to provide a way to use something else than lxml.

#!/usr/bin/env python

from parsel import Selector
from bs4 import BeautifulSoup

print('=== HTML ===')
html = '''<li>
 one
 <div>
</li>
<li>
 two
</li>'''
print(html)

print('=== parsel (lxml) (marginal interpretation) ===')
sel = Selector(text=html)
print(sel.extract())

print('=== html.parser ===')
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

print('=== lxml (same problem as parsel of course) ===')
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())

print('=== html5lib (Parses pages the same way a web browser does) ===')
soup = BeautifulSoup(html, 'html5lib')
print(soup.prettify())

scrapy / parsel

Bad HTML parsing #57