htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.72k stars 419 forks source link

tidy removes line breaks inside a pre element containing another element #1006

Open David-Apps opened 3 years ago

David-Apps commented 3 years ago

Tidy removes the line breaks from the contents of a pre element when that element contains another element such as the code element. I expected tidy to preserve the line breaks inside the pre element (except perhaps a line break that immediately follows the

 tag).

I used HTML Tidy for Linux/x86 version 5.9.17.

Input:

<!DOCTYPE html>
<html lang="en">
<head>
<title>code inside pre</title>
<meta charset="UTF-8">
</head>
<body>
<h1>code inside pre</h1>
<p>This example uses an example from the <a href="https://html.spec.whatwg.org/#the-pre-element">HTML specification</a>.</p>
<pre><code>function Panel(element, canClose, closeHandler) {
  this.element = element;
  this.canClose = canClose;
  this.closeHandler = function () { if (closeHandler) closeHandler() };
}</code></pre>
</body>
</html>

Output:

Info: Document content looks like HTML5
No warnings or errors were found.

<!DOCTYPE html>
<html lang="en">
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux/x86 version 5.9.17">
<title>code inside pre</title>
<meta charset="UTF-8">
</head>
<body>
<h1>code inside pre</h1>
<p>This example uses an example from the <a href=
"https://html.spec.whatwg.org/#the-pre-element">HTML
specification</a>.</p>
<pre><code>function Panel(element, canClose, closeHandler) { this.element = element; this.canClose = canClose; this.closeHandler = function () { if (closeHandler) closeHandler() }; }</code></pre>
</body>
</html>

About HTML Tidy: https://github.com/htacg/tidy-html5
Bug reports and comments: https://github.com/htacg/tidy-html5/issues
Official mailing list: https://lists.w3.org/Archives/Public/public-htacg/
Latest HTML specification: https://html.spec.whatwg.org/multipage/
Validate your HTML documents: https://validator.w3.org/nu/
Lobby your company to join the W3C: https://www.w3.org/Consortium

Would you like to see Tidy in proper, British English? Please consider 
helping us to localise HTML Tidy. For details please see 
https://github.com/htacg/tidy-html5/blob/master/README/LOCALIZE.md
ler762 commented 3 years ago

On 10/24/21, David-Apps @.***> wrote:

Tidy removes the line breaks from the contents of a pre element when that element contains another element such as the code element. I expected tidy to preserve the line breaks inside the pre element (except perhaps a line break that immediately follows the

 tag).

I used HTML Tidy for Linux/x86 version 5.9.17.

looks like a regression -- 5.7.54 does the right thing:

$ tidy /tmp/x.htm Info: Document content looks like HTML5 No warnings or errors were found.

<!DOCTYPE html>

Test

This is the Panel constructor:

function Panel(element, canClose, closeHandler) {
  this.element = element;
  this.canClose = canClose;
  this.closeHandler = function () { if (closeHandler) closeHandler() };
}

5.9.17 doesn't: $ ./tidy-html5/build/cmake/tidy.exe /tmp/x.htm Info: Document content looks like HTML5 No warnings or errors were found.

<!DOCTYPE html>

Test

This is the Panel constructor:

function Panel(element, canClose, closeHandler) {
this.element = element; this.canClose = canClose; this.closeHandler =
function () { if (closeHandler) closeHandler() }; }
ncaq commented 2 years ago

I search bug commit by git bisect.

I get result.

91f29ea7b88a0f3a810d011f958ea9dd935bd65b

Head:     91f29ea HTML Tidy now parses HTML non-recursively.
Tags:     5.9.8-next (1), 5.9.9-next (2)

91f29ea7b88a0f3a810d011f958ea9dd935bd65b is the first bad commit
commit 91f29ea7b88a0f3a810d011f958ea9dd935bd65b
Author: Jim Derry <balthisar@gmail.com>
Date:   Thu Aug 5 08:18:30 2021 -0400
    HTML Tidy now parses HTML non-recursively.

    Instead of recursive calls for each nested level of HTML, the next level is
    pushed to a stack on the heap, and returned to the main loop. This prevents
    stack overflow at _n_ depth (where _n_ is operating-system dependent). It's
    probably still possible to use all of the heap memory, but Tidy's allocators
    already fail gracefully in this circumstance.

    Please report any regressions of your own HTML!

    NOTE: the XML parser is not affected, and is probably still highly recursive.
 regression_testing/cases/dev-cases/case-001.conf   |    4 +
 regression_testing/cases/dev-cases/case-001@0.html |   26 +
 regression_testing/cases/dev-cases/case-002.conf   |    4 +
 regression_testing/cases/dev-cases/case-002@1.html |   33 +
 regression_testing/cases/dev-cases/case-003.conf   |    4 +
 regression_testing/cases/dev-cases/case-003@1.html |   27 +
 regression_testing/cases/dev-cases/case-004.conf   |    4 +
 regression_testing/cases/dev-cases/case-004@1.html |   41 +
 regression_testing/cases/dev-expects/case-001.html |   41 +
 regression_testing/cases/dev-expects/case-001.txt  |   14 +
 regression_testing/cases/dev-expects/case-002.html |   39 +
 regression_testing/cases/dev-expects/case-002.txt  |   16 +
 regression_testing/cases/dev-expects/case-003.html |   30 +
 regression_testing/cases/dev-expects/case-003.txt  |   26 +
 regression_testing/cases/dev-expects/case-004.html |   61 +
 regression_testing/cases/dev-expects/case-004.txt  |   14 +
 regression_testing/cases/special-cases/README.txt  |   15 +
 .../cases/special-cases/case-evil.conf             |    4 +
 .../cases/special-cases/case-evil@1.html           |    6 +
 src/parser.c                                       | 7482 +++++++++-----------
 src/parser.h                                       |   33 +-
 src/tags.h                                         |    2 +-
 22 files changed, 3890 insertions(+), 4036 deletions(-)
 create mode 100755 regression_testing/cases/dev-cases/case-001.conf
 create mode 100755 regression_testing/cases/dev-cases/case-001@0.html
 create mode 100755 regression_testing/cases/dev-cases/case-002.conf
 create mode 100755 regression_testing/cases/dev-cases/case-002@1.html
 create mode 100755 regression_testing/cases/dev-cases/case-003.conf
 create mode 100644 regression_testing/cases/dev-cases/case-003@1.html
 create mode 100755 regression_testing/cases/dev-cases/case-004.conf
 create mode 100644 regression_testing/cases/dev-cases/case-004@1.html
 create mode 100644 regression_testing/cases/dev-expects/case-001.html
 create mode 100644 regression_testing/cases/dev-expects/case-001.txt
 create mode 100644 regression_testing/cases/dev-expects/case-002.html
 create mode 100644 regression_testing/cases/dev-expects/case-002.txt
 create mode 100644 regression_testing/cases/dev-expects/case-003.html
 create mode 100644 regression_testing/cases/dev-expects/case-003.txt
 create mode 100644 regression_testing/cases/dev-expects/case-004.html
 create mode 100644 regression_testing/cases/dev-expects/case-004.txt
 create mode 100644 regression_testing/cases/special-cases/README.txt
 create mode 100755 regression_testing/cases/special-cases/case-evil.conf
 create mode 100644 regression_testing/cases/special-cases/case-evil@1.html

Bisect Rest (1)
91f29ea * bad @ HTML Tidy now parses HTML non-recursively.

Bisect Log (9)
git bisect start 'next' 'bed8efb'
d08ddc2 bad Bump version. No binary change, but does affect environment.
bed8efb good Bump to 5.7.54 based on settings fix.

git bisect good db847e6e1c632c7bf361f7d82daf6736fa43b246
db847e6 good Merge pull request #981 from htacg/iterate

git bisect bad a46949f46a4cc32ed23303d456ad9c20beac3866
a46949f bad Bump to version 5.9.12.

git bisect good c22c37b5a473d4a4b0bbd23cb3051f820b3ff026
c22c37b good Add license to .github

git bisect bad 28068b1273c85d2a4b7c9441530b32d71951b24e
28068b1 bad Fixes #816.

git bisect good b6f7e4384295dd28a3eb1edcd5ee3bed23f08ea5
b6f7e43 good Merge pull request #984 from htacg/issue_946

git bisect bad 2e7ec117fdd3ed5c20e9e92ff4b282239bb7bdcd
2e7ec11 bad Bump version.

git bisect bad 91f29ea7b88a0f3a810d011f958ea9dd935bd65b
91f29ea bad HTML Tidy now parses HTML non-recursively.

91f29ea7b88a0f3a810d011f958ea9dd935bd65b is the first bad commit

Untracked files (1)
.ccls-cache/

Recent commits
91f29ea bad @ HTML Tidy now parses HTML non-recursively.
b6f7e43 good-b6f7e4384295dd28a3eb1edcd5ee3bed23f08ea5 5.9.8-next Merge pull request #984 from htacg/issue_946
efa6152 Fixes #946 by refactoring the recursion into a loop with a heap-based stack.
c055b71 Deleted LICENSE again. Enough is enough.
c22c37b good-c22c37b5a473d4a4b0bbd23cb3051f820b3ff026 Add license to .github
e11dba9 Removed docs.
995c20e Doc folder.
1213047 More static analyser fixes; version bump to 5.9.7.
5f98ccd Static analyzer fixes.
bd751a8 Fix allocation error; fix some static analyzer suggestions.

src/parser.c | 7482 +++++++++----------- I am afraid of the contents of the It is hard to read this.

Kristinita commented 2 years ago

+1

1. Summary

Newest HTML Tidy versions remove line breaks inside <pre>.

I can’t find, how I can prevent this behavior.

2. Examples

HTML-Tidy transform blocks of code like this:

Correct HTML-Tidy real example

To such:

Incorrect HTML-Tidy real example

For languages and markups like YAML, where indentation is required, HTML-Tidy transforms the code into invalid:

Correct HTML-Tidy MCVE

Incorrect HTML-Tidy MCVE

3. MCVE

  1. KiraTidyPygments.html:

    <!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">
            <meta name="viewport" content="width=device-width, initial-scale=1">
            <title>Pygments code block MCVE</title>
        </head>
        <body>
            <!-- [INFO] This code is automatically generated by the “SuperFences” extension for Python Markdown:
            https://facelessuser.github.io/pymdown-extensions/extensions/superfences/#code-highlighting
    
            From:
            ```yaml
            kira:
                goddess: true
        -->
        <div class="SashaBlockHighlight"><pre><span></span><code><span class="nt">kira</span><span class="p">:</span><span class="w"></span>
        <span class="err">  </span><span class="nt">goddess</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span><span class="w"></span>
        </code></pre></div>
    </body>

  2. tidy.conf:

    quiet: yes
    tidy-mark: no
    wrap: 0
    

4. Steps to reproduce

tidy -config tidy.conf -m KiraTidyPygments.html

5. Behavior

5.1. Desired

Preserve line breaks inside <pre>

5.2. Current

-   <div class="SashaBlockHighlight"><pre><span></span><code><span class="nt">kira</span><span class="p">:</span><span class="w"></span>
-   <span class="err">  </span><span class="nt">goddess</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span><span class="w"></span>
-   </code></pre></div>
+   <div class="SashaBlockHighlight">
+   <pre><code><span class="nt">kira</span><span class="p">:</span> <span class="err"> </span><span class="nt">goddess</span><span class="p">:</span><span class="w"> </span><span class="l l-Scalar l-Scalar-Plain">true</span> </code></pre>
+   </div>

HTML-Tidy remove line break inside <pre>. The markup in the code block will be invalid like this:

Incorrect HTML-Tidy MCVE

6. Environment

  1. Microsoft Windows [Version 10.0.19041.1415]
  2. HTML Tidy for Windows version 5.9.14

Thanks.

Kristinita commented 2 years ago

Type: Question :question:

@ler762, @ncaq

Do you have any ideas how to fix this problem without downgrading HTML Tidy?

Thanks.

ncaq commented 2 years ago

@Kristinita I have given up doing indentation in TidyHTML.

Kristinita commented 2 years ago

Type: Question :question:

@ncaq, excuse me, could you elaborate on what you did?

If I disable HTML Tidy indentation via changing indent settings in my configuration file to:

indent: no
indent-spaces: 0

No effect. HTML Tidy still remove line breaks inside <pre>.

Thanks.

ncaq commented 2 years ago

@Kristinita I have decided to only look at the Tidy HTML status code and ignore the indentation results. I am doing this with the following commitments. changed: indentHtml -> tidyHtml · ncaq/www.ncaq.net@706dad8

perette commented 2 years ago

@Kristinita To retrieve the last-working version:

git clone https://github.com/htacg/tidy-html5
cd tidy-html5
git checkout b6f7e4384295dd28a3eb1edcd5ee3bed23f08ea5

And now build as usual.

Kristinita commented 10 months ago

This issue is still relevant as of January 2024. Because of this bug users cannot use the features of new versions of Tidy and are forced to downgrade to previous.

Thanks.