htacg / tidy-html5

The granddaddy of HTML tools, with support for modern standards
http://www.html-tidy.org
2.7k stars 415 forks source link

XHTML5 wrongly recognized as XHTML 1.0 Transitional #657

Open dechamps opened 6 years ago

dechamps commented 6 years ago

On current HEAD (f0438bd):

$ tidy <<EOF
<?xml version="1.0" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head>
<body>
<img src="foo.jpg" alt="Foo" />
<br />
</body></html>
EOF

Reports:

Info: Document content looks like XHTML 1.0 Transitional

Even though the document is quite obviously XHTML5, with the XML preamble combined with <!DOCTYPE html> and the xmlns attribute being dead giveaways.

Weirdly enough, the document is correctly recognized as XHTML5 if the following is added:

<meta charset="utf-8" />
balthisar commented 6 years ago

Yeah, that looks true. Tidy didn't detect any features specific to XHTML5, so it defaults to the lowest possible version. We should look at the missing DTD and realize that it can't be less than XHTML5. When the meta is added, as you indicate, a flag is set indicating (X)HTML5, and so the lowest possible version becomes HTML5 and reports correctly.

On the other hand, that message is an "informational message" (--show-info no can hide them), and doesn't affect processing of the document.

PR's to fix are welcome.

dechamps commented 6 years ago

On the other hand, that message is an "informational message" (--show-info no can hide them), and doesn't affect processing of the document.

Well it does affect warnings at the very least, especially when you take #377 into account:

$ tidy <<EOF
<?xml version="1.0" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head><title></title></head><body>
<table><tr><td></td></tr></table>
</body></html>
EOF
line 9 column 1 - Warning: <table> lacks "summary" attribute
Info: Document content looks like XHTML 1.0 Transitional
Tidy found 1 warning and 0 errors!

Whereas:

$ tidy <<EOF
<?xml version="1.0" ?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta charset="utf-8" />
<title></title>
</head><body>
<table><tr><td></td></tr></table>
</body></html>
EOF
Info: Document content looks like XHTML5
No warnings or errors were found.

(As a matter of fact, I came across this bug while struggling with #377. It took me a while to figure out why the <meta charset="utf-8" /> tag somehow had an effect on the <table> lacks "summary" attribute warning - this was a very confusing moment.)

geoffmcl commented 6 years ago

@dechamps thanks for the issue... as #377 and @balthisar point out, tidy's guess of the type of document is exactly that, a guess, and is done by eliminating bits as certain features are found...

There is a cmake option, -DENABLE_DEBUG_LOG:BOOL=ON, which will add a lot of debug output. It is automatically enabled for a Windows MSVC Debug configuration build, but was recently added to the linux build...

And one of these outputs, also written to a log file for later review, is showing how certain features elimitate bits, leading to the final choice in the TY_(HTMLVersion)(doc) function...

It will show, a diminishing bit list, as each feature is found, like -

Before: HT20|HT32|H40S|H40T|H40F|H41S|H41T|H41F|X10S|X10T|X10F|XH11|XB10|----|HT50|XH50
After : HT20|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50
After : ----|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50
After : ----|----|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|----|----|HT50|XH50
After : ----|----|----|----|----|----|----|----|X10S|X10T|----|XH11|----|----|----|XH50
After : ----|----|----|----|----|----|----|----|X10S|X10T|----|----|----|----|----|XH50
After : ----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|XH50

Now using your sample, and running my Debug tidy5d.exe, copy the log temptidy.txt to temptidy1,txt, then adding the <meta charset="utf-8" /> to the sample, run again, and doing a diff between the two debug logs shows -

 After : ----|----|----|----|----|----|----|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50
+After : ----|----|----|----|----|----|----|----|----|----|----|----|----|----|HT50|XH50

Note the addition of the meta tag eliminated lots more bits, and that effectively changed the Info: output -

Info: Document content looks like XHTML 1.0 Strict
-- to --
Info: Document content looks like XHTML5

And this can then change the warnings, and/or errors, found and reported, as you have discovered...

Now that TY_(HTMLVersion) function already has a lot of changes added, it is quite a mess! already...

But more or less as you suggest, it could be improved to take account of the xml preamble, and especially the xmlns attribute, which could be used to eliminate the pure XML bits, and even the HT50 bit, leaving XHTML5 as the only choice...

Look forward to further feedback, patches, or a PR to achieve this... thanks...

dechamps commented 6 years ago

Thanks for the explanation. I think what would make the most sense is if tidy sees <!DOCTYPE html> (which is a dead giveaway that the document is either HTML5 or XHTML5), it should drop every bit except for HT50 and XH50. Presumably that should do the trick.

geoffmcl commented 6 years ago

@dechamps thanks for the further feedback on this... did some more testing...

... if tidy sees <!DOCTYPE html> ...

Some of this makes good sense, and going back into the past legacy DOCTYPES, that is what tidy used to do... but maybe it is not a question of eliminating bits...

The following is the results of running tidy on a simple document as follows -

<!DOCTYPE "various">
<html>
<head>
<title>Issue #657-6</title>
</head>
<body>
<p>HTML 5</p>
</body>
</html>

Each with the doctype header shown, the Info: output, and the last debug bits output -

Input: input5\in_657-2.html

Info: Doctype given is "-//W3C//DTD XHTML Basic 1.0//EN"
Info: Document content looks like XHTML Basic 1.0
No warnings or errors were found.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.0//EN"
  "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
After : HT20|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50

Input: input5\in_657-3.html

Info: Doctype given is "-//W3C//DTD HTML 3.2//EN"
Info: Document content looks like HTML 3.2
No warnings or errors were found.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
After : HT20|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50

Input: input5\in_657-4.html

Info: Doctype given is "-//W3C//DTD HTML 4.01//EN"
Info: Document content looks like HTML 4.01 Strict
No warnings or errors were found.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
After : HT20|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50

Input: input5\in_657-5.html

Info: Doctype given is "-//W3C//DTD XHTML 1.0 Strict//EN"
Info: Document content looks like XHTML 1.0 Strict
No warnings or errors were found.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
After : HT20|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50    

So we can see in every case, tidy used to show the given doctype, and baring no contra features found in the document, would output the Info: ...looks like... of that doctype... and there has been very little elimination of the bits...

When it came to using <!DOCTYPE html> tidy stopped doing one of the Info: messages???

Input: input5\in_657-6.html

Info: Document content looks like HTML5
No warnings or errors were found.
<!DOCTYPE html>
After : HT20|HT32|H40S|H40T|----|H41S|H41T|----|X10S|X10T|----|XH11|XB10|----|HT50|XH50

Note each of those input test files are in my test repo...

Now what this shows is that the choice of the looks like info is not solely based on the bits remaining, and then choosing among those... so some of my earlier comments are quite misleading in this regard...

Right now I am way back at I do not fully understand TY_(HTMLVersion)(doc)! It is called many times, but it seems the most important call is from TY_(ReportMarkupVersion)(doc), which is where the looks like info message is generated...

One of the most frequent calls is from GetTokenFromStream, just to set fixComments = (TY_(HTMLVersion)(doc) & HT50) == 0;...

First the Doctype given is ... message is only emitted IFF doc->givenDoctype is set. So then why is this not set for <!DOCTYPE html>? It seems it should be set even in this case - a doctype has been given...

Then it calls TY_(ApparentVersion)( doc );, but only if ( ! cfgBool(doc, TidyXmlTags) ). This can be user set by --input-xml, and there are a few places in the code where it can be set... that needs to be explored... but leaving xml parsing out of this for the moment...

But the results of TY_(ApparentVersion)( doc ); are passed to TY_(HTMLVersionNameFromCode)(...), and if this results in a string then that is output... if not then tidyLocalizedString(STRING_HTML_PROPRIETARY); string is output...

As indicated I have tried many times in the past to get my head around all this, and am still coming up short...

So I am not sure it is as simple as using the existence of <!DOCTYPE html> to eliminate all bits except HT50|XH50, while as you state "that should do the trick"! You will note in the 5 examples, essentially all bits remain...

But I am sure doc->givenDoctype should be set, even for <!DOCTYPE html>, but that too is not the whole story...

But do agree with the presence of <!DOCTYPE html> is "a dead giveaway that the document is either HTML5 or XHTML5"...

Sorry I seem to be raising more questions than answers here! Even bringing into question exactly what does this Info: ... looks like ... really mean... but agree in this simple sample given case tidy should not wrongly report XHTML 1.0... just how to solve it, without breaking backward legacy docs...

Will continue to look at this, explore, experiment, but meantime seek further feedback, ideas, comments, even test patches or PR... thanks...