htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.72k stars 419 forks source link

do not remove leading comment (body only) #803

Open haraldschilly opened 5 years ago

haraldschilly commented 5 years ago

I want tidy to keep a leading comment (documentation and licensing info) when formatting a partial html file. A minimal example is

<!-- foo -->

<div>bar</div>

but after running tidy -modify --show-body-only auto --indent yes test.html

<div>
bar
</div>

there is no comment. The equivalent for a "full document" keeps the comment. I ran this with tidy version 5.2.0 and also built from scratch a version 5.7.22 myself.

geoffmcl commented 5 years ago

@haraldschilly, thank you for the feature request... I think ;=))

Wow, while this sounds like a simple feature request, I started to look how easy this would be... theoretically...

Running your snippet sample, tidy has already built a node tree, seen best in the debug output - see ENABLE_DEBUG_LOG cmake build option -

All nodes AFTER clean and repair
Root
 Comment # this is your `<!-- foo -->`
 DocType   PUBLIC
 StartTag html implicit
  StartTag head implicit
   StartTag meta implicit  name="generator" content="HTML Tidy for HTML5 for Windows version 5.7.22"
   StartTag title implicit
  StartTag body implicit
   StartTag div
    Text   (3) 'bar'

Notice it has already built nodes, before <body>, most marked as implicit... regardless of the TidyBodyOnly state... your comment is stored first in this node list/tree...

Then, only at output, is this simple what to print logic applied -

        if ( xmlOut && !xhtmlOut )
            TY_(PPrintXMLTree)( doc, NORMAL, 0, &doc->root );
        else if ( showBodyOnly( doc, bodyOnly ) )
            TY_(PrintBody)( doc );
        else
            TY_(PPrintTree)( doc, NORMAL, 0, &doc->root );

And, of course, in PrintBody, it just finds the body, and pprints that... all done...

So what this request sort of asks is that the PrintBody service, search back to the root, checking that all nodes are implicit... but seems this not presently added to the DocType node?! ... to check for any comment at the root... or something...

This is due to the basic way libTidy is structured... and since a comment is seemingly allowed anywhere - my in_803-3.html passes nu... need ref on that?

But maybe others can see a code way forward, making it easy... I stress code way...

And then there is the use case? Except for a snippet case, like you have shown, are there others who think this valid feature?

Look forward to feedback, comments, patches, PR, etc, etc... thanks...

haraldschilly commented 5 years ago

I can't comment on any technical details, but besides the use case of retaining a license/comment message at the top when editing a partial html file, it's also different from formatting a "whole" html page. e.g.

<!-- license: MIT -->
<html>
<body>
<div>
bar
</div>

after tidy -modify <filename>

ends up as this, retaining the comment at the top

<!-- license: MIT -->
<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for HTML5 for Linux version 5.7.22">
<title></title>
</head>
<body>
<div>bar</div>
</body>
</html>