htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.72k stars 420 forks source link

Feature request: trim whitespace from self-closing tags #806

Open ghost opened 5 years ago

ghost commented 5 years ago

Replicate

Create a new file (index.html):

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8"/>
  <title>Title</title>
  <meta name="description" content="Description"/>
  <meta name="author" content="Author"/>
</head>
<body>
  <p>Hello, world!</p>
</body>
</html>

Then run:

tidy -asxml -m -q --wrap 0 --tidy-mark no --vertical-space auto index.html

Expected Results

The self-closing tags resemble:

<meta name="description" content="Description"/>

Actual Results

There's an extraneous space at the end of the HTML element that can be eliminated.

<meta name="description" content="Description" />

Additional Details

An option to trim self-closing tags of unnecessary whitespace would help reduce HTML file sizes.

geoffmcl commented 5 years ago

@DaveJarvis thank you for the feature request... to reduce output spaces, more...

Since this is already a quite very special config case, namely --vertical-space auto, maybe we do not need an additional option for this...

I think I would be ok with an output change to further compress the stream, in this specific case...

And maybe there is not a big use case for it otherwise... that is, create a new option... but...

Look forward to feedback, patches, PR... to explore this... maybe simple... thanks...

geoffmcl commented 5 years ago

@DaveJarvis on commencing some research into this, note I may have overlooked a second, quite unique, option, -asxml, syn. asxhtml, you have used...

In my tests, if this option is removed, there is no " />" in the output... hmmmm...

So ask, is this a difference between html/html5 and xhtml specs... in which case the space may be needed, to conform?

Need to explore that... any references, pointers, ... very welcome... thanks...

ghost commented 5 years ago

See:

EmptyElemTag | ::= | '<' Name (S Attribute)* S? '/>'

The grammar shows that the white space character (S) before the closing tag token is optional (?).

geoffmcl commented 5 years ago

@DaveJarvis, thank you for the references, particularly the W3C, since that is our affiliation, where it is optional... and libTidy chose to add it, if xhtml output...

So I am back to maybe suppressing it, IIF options -asxhtml AND --vertical-space auto are active... should be simple pprint patch...

Or does it need an additional new option? Then it could be applied to all xhtml output... regardless of the vertical-space option... then what name? specs? docs? etc...

Look forward to further feedback, comments, patches, even PR, to test it... thanks...

ghost commented 5 years ago

I'd leave it out by default. If someone needs that particular space, they can pretty-print the XML in any number of ways.

geoffmcl commented 5 years ago

@DaveJarvis thanks for the feedback...

Some code research into this...

In printing the meta tag, in PPrintTag, this is the code that ADDS the space...

    if ( (xmlOut || xhtmlOut) &&
         (node->type == StartEndTag || TY_(nodeCMIsEmpty)(node)) )
    {
        AddChar( pprint, ' ' );   /* Space is NS compatibility hack <br /> */
        AddChar( pprint, '/' );   /* Required end tag marker */
    }

So it seems the optional space was added because of NetScape compatible hack for <br />... can that be true... hmmmm, that does not seem sufficient reason today to continue this space... maybe...

There is a macro, TidyAddVS, which returns no, if current TidyVertSpace is TidyAutoState... or could add a clearer test, like cfgAutoBool( doc, TidyVertSpace ) == TidyAutoState... the xhtmlOut can be tested for on ... so...

So could easily kludge together a condition to avoid adding that space... need to work on that...

As stated, look forward to further feedback, comments, patches, even PR, to test it... thanks...

ghost commented 5 years ago

It appears it was required by Netscape 4, possibly IE 4, and earlier:

A separate option to enable the space could be useful, but given that those versions of NS and IE have bit the big ol' bit bucket in the sky, there's probably no need. (Again, someone could always run the output through a beautifier that puts it back.)

geoffmcl commented 5 years ago

@DaveJarvis thank you for the great research... certainly adds a historic perspective... no doubt tidy follows that W3C 2000/2002 REC...

But that is marked superseded... I tried to find the 2018? replacement... found some things... but no mention of Empty Elements... especially related to xhtml ... help...

Look forward to feedback, patches, PR, ... testing this space removal... under certain conditions, or always, or what? ... thanks...

ghost commented 5 years ago

The <br/> element is deprecated, but for an overview of the superseded syntax, see the XHTML 2 specification:

The space can be safely removed according to the specifications.