Closed Mr0grog closed 5 years ago
Sure, a PR is welcome. Your analysis looks correct to me. Just remember to keep the check for hasattr('Doctype') to exclude bs3.
Also I dont know why I didn't use the static method in the first place, perhaps because it did not exist at the time, in which case, it might be good to add a check for it existing.
When using the Beautiful Soup tree builder and not returning the root (so the result is the actual Beautiful Soup object), the doctype gets mangled. For example:
Results in:
The doctype wound up getting repeated inside itself:
<!DOCTYPE <!DOCTYPE html>>
.It looks like this is happening because
html5_parser.soup.parse
is working a little too hard in its internaladd_doctype
function: https://github.com/kovidgoyal/html5-parser/blob/10008d1e28361426e25aa2ff5903582fdab03c07/src/html5_parser/soup.py#L132-L135It calls
bs.Doctype
with a full<!DOCTYPE html>
string, but that’s not actually what theDoctype
constructor takes: https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L796In fact, there’s even a nice static method that takes the same arguments
add_doctype
does, so I think html5-parser could probably just use that.I’m happy to submit a PR to add a test and fix this, just let me know if my diagnosis here makes sense :)