htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.71k stars 417 forks source link

On empty list, tidy transforms a valid XHTML file into an invalid one #768

Open hosiet opened 5 years ago

hosiet commented 5 years ago

I'm forwarding some longstanding downstream issues here, one of which is about empty list. Previous reports:

Tidy transforms some valid XHTML file into an invalid one. For instance, the source has:

<ul class="ul"><li class="li"></li></ul>

which is valid. Tidy removes the empty li, but not the ul (this doesn't happen if one removes the class attribute), so that one gets:

<ul class="ul"></ul>

which is invalid (there must be at least one li).

Sample test case:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- $Id: tidy-empty-list.html 43963 2011-05-26 12:08:28Z vinc17/ypig $ -->
<head>
<title>Test of tidy on an empty list</title>
</head>
<body>
<p>Debian's <cite>Tidy</cite> 20091223cvs-1 transforms this valid XHTML
file into an invalid one: it removes the empty <samp>li</samp> but keeps
the <samp>ul</samp> element due to its <samp>class</samp> attribute!</p>
<ul class="ul"><li class="li"></li></ul>
</body>
</html>
geoffmcl commented 5 years ago

@hosiet thank you for cross posting this here... and the sample xhtml...

I can confirm that even current tidy 5.7.16, will drop the empty <li>, as does that old 20091223cvs-1 version...

In the current version you can add --drop-empty-elements no option to the config to avoid this...

But this ref - https://www.w3.org/2010/04/xhtml10-strict.html#elem_ul - says At least one of li, thus as you suggest, an empty list is invalid in XHTML - need more W3C references - and libtidy needs a fix... should not be difficult...

Appreciate further feedback, patches or PR... thanks...

geoffmcl commented 5 years ago

@hosiet looking further into this... at first I though it might be a HTML4/one or more li, versus HTML5/0 or more li, something addressed in #396... but now think this is maybe a configuration issue...

If you tell tidy the input is to be treated as well formed XML, with either -xml, or --input-xml yes, then the TY_(ParseXMLDocument)(TidyDocImpl* doc) would be used, which does not end the parsing with TY_(DropEmptyElements)(doc, &doc->root); and I think you will get the desired output...

F:\Projects\tidy-test\test>tidy5 -v
HTML Tidy for Windows version 5.7.16
F:\Projects\tidy-test\test>tidy5 -xml input5\in_768.html
No warnings or errors were found.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!-- $Id: tidy-empty-list.html 43963 2011-05-26 12:08:28Z vinc17/ypig $ -->
<head>
<title>Test of tidy on an empty list</title>
</head>
<body>
<p>Debian's
<cite>Tidy</cite>20091223cvs-1 transforms this valid XHTML file
into an invalid one: it removes the empty
<samp>li</samp>but keeps the
<samp>ul</samp>element due to its
<samp>class</samp>attribute!</p>
<ul class="ul">
<li class="li"></li>
</ul>
</body>
</html>
F:\Projects\tidy-test\test>tidy-2009 -v
HTML Tidy for Windows released on 25 March 2009
**same output**

As can be seen, this also works for the tidy-2009 release...

To repeat, this only happens if tidy is allowed to default to using its HTML parser... where, at least in HTML5, such a deletion is not a problem... and can be overridden with the option --drop-empty-elements no, as a user choice...

The static Bool CanPrune(...) service could be enhanced to do some check on the tidy mode, if this problem needs to be addressed in HTML4 documents... but maybe that could be addressed as a separate new issue... thanks...

Does this solve the problem of deleting the empty <li>... in valid xhtml... thanks...