jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.88k stars 2.17k forks source link

JSoup differs from browsers around commented HTML attributes #1938

Open panthony opened 1 year ago

panthony commented 1 year ago

Hi,

I encountered a case where JSoup differs from what browsers (Chrome, Firefox Safari) do.

Using this piece of HTML on try jsoup:

<html>
<head>
<title>Try jsoup</title>
</head>
<body>
  <h1>before</h1>
  <div <!--="" id="hidden" --="">
      <h1>within</h1>
  </div>
   <h1>after</h1>
</body>
</html>

Jsoup will produce:

<html>
 <head>
  <title>Try jsoup</title>
 </head>
 <body>
  <h1>before</h1>
  <div>
   <!--="" id="hidden" --="">
      <h1>within</h1>
  </div>
   <h1>after</h1>
</body>
</html>
-->
  </div>
 </body>
</html>

Commenting the rest of the body whereas all major navigators will escape the comment character and shows the 3 titles.

panthony commented 1 year ago

Probably a similar issue than https://github.com/jhy/jsoup/issues/1483 except here it comment pretty much all the HTML.

jhy commented 1 year ago

Yes I believe @panthony is right -- the browsers aren't treating this as a comment but as attributes on the div tag, like:

<div
  Attr: <!--
  Attr: id = hidden
  Attr: --
>

Will need to revisit #1483, either implement my idea or scrap the attempt to handle missing > and just hard follow the spec.