jhy / jsoup

jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.
https://jsoup.org
MIT License
10.94k stars 2.19k forks source link

Improperly formatting text within a <pre> tag #1891

Closed NiccoMlt closed 1 year ago

NiccoMlt commented 1 year ago

Hi, apparently Jsoup formats the content inside a <pre> tag, resulting in a non-equivalent rendering. Given the following HTML:

<!DOCTYPE html>
<html lang="en">
<head><title>Test</title></head>
<body>
    <div>
        <pre><span><b><u><span>TEST</span></u></b></span></pre>
    </div>
</body>
</html>

And running the following Java code

final String html =
        "<!DOCTYPE html>\n"
        + "<html lang=\"en\">\n"
        + "<head><title>Test</title></head>\n"
        + "<body>\n"
        + "    <div>\n"
        + "        <pre><span><b><u><span>TEST</span></u></b></span></pre>\n"
        + "    </div>\n"
        + "</body>\n"
        + "</html>";
final String parsed = Jsoup.parse(html).toString();
System.out.println(parsed);

the result is

<!DOCTYPE html> 
<html lang="en"> 
<head>
 <title>Test</title>
</head> 
<body> 
 <div> 
  <pre><span><b><u>
      <span>TEST</span>
     </u></b></span></pre> 
 </div>  
</body>
</html>

I'm using latest 1.15.3 version

jhy commented 1 year ago

I'm not able to repro this in 1.15.4. See this example, returns:

  <div>
   <pre><span><b><u><span>TEST</span></u></b></span></pre>
  </div>

jsoup does test if an element is in a <pre> (in Element#preserveWhitespace()) and will preserve textnode formatting; and should not be otherwise formatting elements. There is a limit (6 up levels) of stack depth as an optimization for serialization time, but that wouldn't be impacting in this instance. I guess this issue was resolved in one of the pretty-print fixes in 1.15.4 but haven't checked yet.

Can you review with 1.15.4? If you find other cases where's it's not working as desired, happy to take a look.

NiccoMlt commented 1 year ago

Hi, thank you for your answer, you are right about the minimum example, it seems to be fixed.

Sadly, I'm still experiencing the problem when moving to my acutal document; I cannot provide the full document, but I can provide another example:

<div>
    <pre><span><b><u><o:p>TEST</o:p></u></b></span></pre>
</div>

The following code under Jsoup 1.15.4 will be formatted as:

<html>
 <head></head>
 <body>
  <div>
   <pre><span><b><u>
       <o:p>TEST
       </o:p></u></b></span></pre>
  </div>
 </body>
</html>

Note that I replaced the <span> tag to an Office-namespaced paragraph tag <o:p>.

HTML documents with these tags are usually produced by tools like Microsoft Word and Microsoft Outlook.

jhy commented 1 year ago

Thanks for the updated detail -- fixed