kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

'soup' treebuilder adds 'xmlns' prefix to 'xmlns' attribute on inline svg element #6

Closed jpark3000 closed 7 years ago

jpark3000 commented 7 years ago

I am parsing some xhtml like so:

from html5_parser import parse

xhtml = """<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>foo</title>
</head>

<body class="sgc-1">
  <svg xmlns="http://www.w3.org/2000/svg" height="100%" preserveAspectRatio="xMidYMid meet" version="1.1" viewBox="0 0 600 800" width="100%" xmlns:xlink="http://www.w3.org/1999/xlink">
    <image height="800" width="573" xlink:href="../Images/Cover.jpg"></image>
  </svg>
</body>
</html>"""

soup = parse(xhtml_string, maybe_xhtml=True, treebuilder='soup', return_root=False, keep_doctype=False)

But when I examine the returned soup, the xmlns="http://www.w3.org/2000/svg" attribute on the inline <svg> element has become xmlns:xmlns="http://www.w3.org/2000/svg".

i.e.

<html xmlns="http://www.w3.org/1999/xhtml"><head>
  <title></title>
</head>

<body class="sgc-1">
  <svg height="100%" preserveAspectRatio="xMidYMid meet" version="1.1" viewBox="0 0 600 800" width="100%" xmlns:xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <image height="800" width="573" xlink:href="../Images/Cover.jpg"></image>
  </svg>

</body></html>
kovidgoyal commented 7 years ago

beautifulsoup does not support xhtml, only html5. If you want to work with xhtml use the lxml tree builder. In HTML 5 XML namespaces are ignored. (maybe_xhtml does not have any affect with the soup builder, as is noted int eh documentation).