jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.95k stars 2.19k forks source link

Jsoup parse incorrectly self closed tags inside <noscript> tag in the <head> #927

Open husam-otri opened 7 years ago

husam-otri commented 7 years ago

Jsoup parse incorrectly self closed tags - iframe, img - inside

for example:


<head>
<noscript><iframe src="//www.googletagmanager.com/ns.html?id=GTM-W3LK7G" height="0" width="0" style="display:none;visibility:hidden"/> </noscript>
<noscript><img src="http://example.com" /></noscript>
<meta />
</head>
<body>
<div>main</div>
</body>
</html>

the parsing result is:
<html>
 <head> 
  <noscript>
   &lt;iframe src="//www.googletagmanager.com/ns.html?id=GTM-W3LK7G" height="0" width="0" style="display:none;visibility:hidden"&gt; 
  </noscript> 
  <noscript>
   &lt;img src="http://example.com"&gt;
  </noscript> 
  <meta> 
 </head> 
 <body> 
  <div>
   main
  </div>  
 </body>
</html>
robsonpeixoto commented 7 years ago

@husam-otri it looks an intentional behave. https://github.com/jhy/jsoup/blob/master/src/test/java/org/jsoup/parser/HtmlParserTest.java#L512-L516 https://github.com/jhy/jsoup/commit/ee5d4dffb382e362b52af67a53dffc92cc9ced49

Why was it necessary @jhy?

mdn reference: https://developer.mozilla.org/en/docs/Web/HTML/Element/noscript

jhy commented 7 years ago

The spec has noscript just containing text content when scripting is disabled (which it is because jsoup doesn't support scripting). I could imagine disabling that check though and treating it as a regular tag.

blacelle commented 2 weeks ago

A related issue is that .text() on

<head>
<noscript>
  <img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=387891508999354&ev=PageView&noscript=1"/>
</noscript>
</head>

returns <img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=387891508999354&ev=PageView&noscript=1"/> while I would expect an empty String.