kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Doctype gets mangled with treebuilder='soup' and return_root=False #18

Closed Mr0grog closed 5 years ago

Mr0grog commented 5 years ago

When using the Beautiful Soup tree builder and not returning the root (so the result is the actual Beautiful Soup object), the doctype gets mangled. For example:

html = '''<!doctype html>
  <html>
    <head><title>Title Text</title></head>
    <body>Hello!</body>
  </html>
'''
soup = html5_parser.parse(html, treebuilder='soup', return_root=False)
print(soup)

Results in:

<!DOCTYPE <!DOCTYPE html>>
<html><head><title>Title Text</title></head>
    <body>Hello!

</body></html>

The doctype wound up getting repeated inside itself: <!DOCTYPE <!DOCTYPE html>>.

It looks like this is happening because html5_parser.soup.parse is working a little too hard in its internal add_doctype function: https://github.com/kovidgoyal/html5-parser/blob/10008d1e28361426e25aa2ff5903582fdab03c07/src/html5_parser/soup.py#L132-L135

It calls bs.Doctype with a full <!DOCTYPE html> string, but that’s not actually what the Doctype constructor takes: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L796

In fact, there’s even a nice static method that takes the same arguments add_doctype does, so I think html5-parser could probably just use that.

I’m happy to submit a PR to add a test and fix this, just let me know if my diagnosis here makes sense :)

kovidgoyal commented 5 years ago

Sure, a PR is welcome. Your analysis looks correct to me. Just remember to keep the check for hasattr('Doctype') to exclude bs3.

Also I dont know why I didn't use the static method in the first place, perhaps because it did not exist at the time, in which case, it might be good to add a check for it existing.