htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 417 forks source link

Extra newline inserted inside inlined script element #802

Open da2x opened 5 years ago

da2x commented 5 years ago

Steps to reproduce:

echo -e "<body>inline<script>\n();</script></body>" | tidy -q

Test results:

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.6.0">
<title></title>
</head>
<body>
inline
<script>

();
</script>
</body>
</html>

Issue doesn’t happen when the element is’t inlined:

echo -e "<body><script>\n();</script></body>" | tidy -q

Test results:

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.6.0">
<title></title>
</head>
<body>
<script>
();
</script>
</body>
</html>

Version: HTML Tidy for Linux version 5.6.0

geoffmcl commented 5 years ago

@da2x in quick testing, it seems maybe you add the newline, in echo \n...

That is, echo -e "<body>inline<script>();</script></body>" | tidy -q produces none, I think...

But the input interpretation does seem to vary on the parent node... text or body in this case... need to debug... understand...

And could be related to #801... not sure... will try to investigate...

Meantime, look forward to feedback, comments, patches, PR, etc... thanks...

da2x commented 5 years ago

I'm indeed adding one newline. However, note that the only difference in the two example documents is the exact string inline. There isn't any change in the number of newlines input yet the output differ in that regard.

geoffmcl commented 5 years ago

@da2x it is not the exact string inline, but that you have created, in libTidy, a text node, parent body, for this <script...> block... and as agreed do not yet understand the different input string interpretations... before we get to output...

Well I do understand libTidy collection of bytes, from the stream, depends on the mode... and this can change, like when the node accepts preformated text, so \n becomes significant... ie is faithfully stored for output... so still could be related to #801... not sure...

As stated, need to explore, in debug mode, etc... to understand, or fix something... we agree there is a different output... why?

Look forward to feedback, comments, patches, PR, etc... thanks...

geoffmcl commented 5 years ago

@da2x done a little bit of debugging... now agree not related to #801...

But probably is related to the mode active during input streams token collection...

Can see what seems to be the problem in the debug output... some lines deleted...

Case 1:in_802.html, with "<body>any <script>\n();</script></body>", get
Enter ParseBody...
R=1 C=7: Returning text_node TextNode [any ]4 stream # can be ANY text, w/wout space
R=1 C=11: Returning starttag node <script> stream
R=1 C=19: Returning lex-cdata TextNode [\n();]4 lexer
...
Case 2:in_802-1.html, with "<body><script>\n();</script></body>", get
Enter ParseBody...
R=1 C=7: Returning starttag node <script> stream
R=2 C=1: Returning lex-cdata TextNode [();]3 lexer
...

Note in case 1, the text stored, correctly, includes the newline [\n();]4, and it will be added to the output... while in case 2, the newline has been lost! [();]3 - UGH! Why?

Maybe it is a little misleading to say extra newline inserted ..., as in this title... but it looks a little like that in output... but...

The first newline in a script tag's CDATA like text is lost in case 2:... BUG

It is preserved in case 1:, and faithfully added to the output...

Still to completely understand why... am on its tail...

Look forward to feedback, comments, patches, PR, etc, etc... thanks...

pvdosev commented 1 year ago

Hello! This is also an issue I've been having, and I didn't realize it was related to inline elements. For me, it is happening because of placing a