kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

broken html if empty title #10

Closed thestick613 closed 6 years ago

thestick613 commented 6 years ago
In [1]: data = """<!DOCTYPE html><html>
   ...: <head>
   ...: <title></title>
   ...: </head>
   ...: <body>
   ...: <div align="center">
   ...: someting
   ...: </div>
   ...: </body>
   ...: </html>"""

In [2]: import html5_parser

In [3]: tree = html5_parser.parse(data, transport_encoding='utf8', treebuilder="lxml")

In [4]: import lxml

In [5]: lxml.etree.tostring(tree)
Out[5]: '<html><head>\n<title/>\n</head>\n<body>\n<div align="center">\nsometing\n</div>\n\n</body></html>'

In [6]: print lxml.etree.tostring(tree)
<html><head>
<title/>
</head>
<body>
<div align="center">
someting
</div>
</body></html>

This isn't right. Now my browser thinks that the rest of the html is part of the title.

kovidgoyal commented 6 years ago

That has nothing to do with html5-parser. html5-parser simply creates the tree, how it is serialized back to a string is not in its control. Note that lxml.etree.tostring bydefault produces XML output not HTML.

thestick613 commented 6 years ago

My bad, should have used lxml.html.tostring(tree). I'm migrating from beautifulsoup, so things look a bit strange.