Open akshayphilar opened 5 years ago
It’s invalid HTML, nonetheless.
I wonder if any of the suggested alternative parsers support it…
@Gallaecio @akshayphilar Only lxml
doesn't support it, both Python html.parser
and html5lib
do. Still, not sure how to fix it if still using lxml
, they're ignoring some similar bugs (like tag replacement) for ages :)
In [1]: from bs4 import BeautifulSoup
In [2]: html_1 = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'
In [3]: html_2 = '<html><body><title><-Thor></title></body></html>'
In [4]: html_3 = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'
...:
In [7]: soup1_hp = BeautifulSoup(html_1, "html.parser")
In [8]: soup1_lxml = BeautifulSoup(html_1, "lxml")
In [9]: soup1_html5 = BeautifulSoup(html_1, "html5lib")
In [10]: soup2_hp = BeautifulSoup(html_2, "html.parser")
In [11]: soup2_lxml = BeautifulSoup(html_2, "lxml")
In [12]: soup2_html5 = BeautifulSoup(html_2, "html5lib")
In [13]: soup3_hp = BeautifulSoup(html_3, "html.parser")
In [14]: soup3_lxml = BeautifulSoup(html_3, "lxml")
In [15]: soup3_html5 = BeautifulSoup(html_3, "html5lib")
In [16]: html_1
Out[16]: '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'
In [17]: soup1_hp
Out[17]: <html><body><title><-Avengers-></title><div>Release Date</div></body></html>
In [18]: soup1_lxml
Out[18]: <html><body><title></title><div>Release Date</div></body></html>
In [19]: soup1_html5
Out[19]: <html><head></head><body><title><-Avengers-></title><div>Release Date</div></body></html>
In [20]: html_2
Out[20]: '<html><body><title><-Thor></title></body></html>'
In [21]: soup2_hp
Out[21]: <html><body><title><-Thor></title></body></html>
In [22]: soup2_lxml
Out[22]: <html><body><title></title></body></html>
In [23]: soup2_html5
Out[23]: <html><head></head><body><title><-Thor></title></body></html>
In [24]: html_3
Out[24]: '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'
In [25]: soup3_hp
Out[25]: <html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>
In [26]: soup3_lxml
Out[26]: <html><body><title>Avengers-></title><div>Release Date</div></body></html>
In [27]: soup3_html5
Out[27]: <html><head></head><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>
We’ll probably have to support alternative parsers. Doing so would solve a handful of issues currently reported.
Text starting with "<-" within the HTML body is completely ignored, examples follow.
Note: XML tag names starting with a hyphen are invalid as per the W3C XML spec
Example 1
Example 2
Example 3