scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.14k stars 146 forks source link

Text Starting with "<-" are Ignored #126

Open akshayphilar opened 5 years ago

akshayphilar commented 5 years ago

Text starting with "<-" within the HTML body is completely ignored, examples follow.

Note: XML tag names starting with a hyphen are invalid as per the W3C XML spec

Example 1

>>> html = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title></title><div>Release Date</div></body></html>'

Example 2

>>> html = '<html><body><title><-Thor></title></body></html>'
>>> Selector(html).extract()
'<html><body><title></title></body></html>'

Example 3

>>> html = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'
>>> Selector(html).extract()
'<html><body><title>Avengers-&gt;</title><div>Release Date</div></body></html>'
Gallaecio commented 5 years ago

It’s invalid HTML, nonetheless.

I wonder if any of the suggested alternative parsers support it…

sortafreel commented 5 years ago

@Gallaecio @akshayphilar Only lxml doesn't support it, both Python html.parser and html5lib do. Still, not sure how to fix it if still using lxml, they're ignoring some similar bugs (like tag replacement) for ages :)

In [1]: from bs4 import BeautifulSoup                                                                                                

In [2]: html_1 = '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'                                      

In [3]: html_2 = '<html><body><title><-Thor></title></body></html>'                                                                  

In [4]: html_3 = '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'                         
   ...:                                                                                                                              

In [7]: soup1_hp = BeautifulSoup(html_1, "html.parser")                                                                              

In [8]: soup1_lxml = BeautifulSoup(html_1, "lxml")                                                                                   

In [9]: soup1_html5 = BeautifulSoup(html_1, "html5lib")                                                                              

In [10]: soup2_hp = BeautifulSoup(html_2, "html.parser")                                                                             

In [11]: soup2_lxml = BeautifulSoup(html_2, "lxml")                                                                                  

In [12]: soup2_html5 = BeautifulSoup(html_2, "html5lib")                                                                             

In [13]: soup3_hp = BeautifulSoup(html_3, "html.parser")                                                                             

In [14]: soup3_lxml = BeautifulSoup(html_3, "lxml")                                                                                  

In [15]: soup3_html5 = BeautifulSoup(html_3, "html5lib")                                                                             

In [16]: html_1                                                                                                                      
Out[16]: '<html><body><title><-Avengers-></title><div>Release Date</div></body></html>'

In [17]: soup1_hp                                                                                                                    
Out[17]: <html><body><title>&lt;-Avengers-&gt;</title><div>Release Date</div></body></html>

In [18]: soup1_lxml                                                                                                                  
Out[18]: <html><body><title></title><div>Release Date</div></body></html>

In [19]: soup1_html5                                                                                                                 
Out[19]: <html><head></head><body><title>&lt;-Avengers-&gt;</title><div>Release Date</div></body></html>

In [20]: html_2                                                                                                                      
Out[20]: '<html><body><title><-Thor></title></body></html>'

In [21]: soup2_hp                                                                                                                    
Out[21]: <html><body><title>&lt;-Thor&gt;</title></body></html>

In [22]: soup2_lxml                                                                                                                  
Out[22]: <html><body><title></title></body></html>

In [23]: soup2_html5                                                                                                                 
Out[23]: <html><head></head><body><title>&lt;-Thor&gt;</title></body></html>

In [24]: html_3                                                                                                                      
Out[24]: '<html><body><title><-<span>Avengers</span>-></title><div>Release Date</div></body></html>'

In [25]: soup3_hp                                                                                                                    
Out[25]: <html><body><title>&lt;-<span>Avengers</span>-&gt;</title><div>Release Date</div></body></html>

In [26]: soup3_lxml                                                                                                                  
Out[26]: <html><body><title>Avengers-&gt;</title><div>Release Date</div></body></html>

In [27]: soup3_html5                                                                                                                 
Out[27]: <html><head></head><body><title>&lt;-&lt;span&gt;Avengers&lt;/span&gt;-&gt;</title><div>Release Date</div></body></html>
Gallaecio commented 4 years ago

We’ll probably have to support alternative parsers. Doing so would solve a handful of issues currently reported.