False positive 0008:ol-ul/valid when <li/> element includes a EmptyElemTag

christianhujer commented 4 years ago

Given the following well-formed and valid XHTML 5.2 file:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
    <title>Contact Us</title>
</head>
<body>
    <ul class="mainMenu nav nav-pills">
        <li><i/> Home</li>
    </ul>
</body>
</html>

Milo reports a false positive:

false-positives/ul-ol-li-children.xhtml:0:1: UL and OL must only have LI direct children (0008:ol-ul/valid)

paganotoni commented 4 years ago

Ha! This one is interesting. There is a problem with tag parity there as i is not a void element. But definitely something to review on the ul|ol/li rule. Thanks @christianhujer

christianhujer commented 4 years ago

Of course  is valid element. And that it's not a void element is irrelevant, as that distinction is only important for the HTML syntax of HTML documents, not the XML syntax for HTML documents. As the file suffix and the use of the xmlns attribute indicate, this file is an HTML document in XML syntax.  is empty italics in EmptyElemTag syntax. And both, Nu Valiator and xmllint agree on the well-formedness and validity of above sample document. The use case is typically in combination with something like Font Awesome, like .

But even if we assume HTML syntax, the error is wrong because still the  is a child of <li> and not <ul>.

paganotoni commented 4 years ago

🤔 Thanks for sharing that. We initially were not considering the XML syntax for HTML documents, and to be honest I may not have much experience with that as we have been focusing on the HTML side of it. Maybe we should be explicit about this and focus on that small portion of the specification, ensure it works well to then expand to the larger spec that includes XML for HTML.

May I get your advise on documents/specs that we could read to understand deeply the spec? Appreciate your help and guidance with all this.

christianhujer commented 4 years ago

The XML Syntax is actually much simpler than the HTML Syntax. When it comes to the syntax itself, you can take a shortcut: Simply use a non-validating XML parser and report whatever errors it throws. In XML-language, parsing an XML document implies a check for well-formedness, which is XML-speak for syntax-check.

Then use the same semantics as those for HTML for semantical validation.

That means the checks would need to be grouped into syntactical checks and semantical checks. This is probably a good idea anyway.

To have nice error messages, you would need an XML parser that provides meta-information for nodes such as line and column. I don't know if such an XML parser exists, but it should be easy to create.

The relevant specifications are these, sorted by importance for linting HTML (most important first):

Extensible Markup Language (XML) 1.0 (Fifth Edition) https://www.w3.org/TR/2008/REC-xml-20081126/
Namespaces in XML 1.0 (Third Edition) https://www.w3.org/TR/2009/REC-xml-names-20091208/
HTML 5.2 https://www.w3.org/TR/2017/REC-html52-20171214/
Scalable Vector Graphics (SVG) 1.1 (Second Edition) https://www.w3.org/TR/2011/REC-SVG11-20110816/
HTML+RDFa 1.1 - Second Edition https://www.w3.org/TR/2015/REC-html-rdfa-20150317/
XHTML+RDFa 1.1 - Third Edition https://www.w3.org/TR/2015/REC-xhtml-rdfa-20150317/
RDFa Lite 1.1 - Second Edition https://www.w3.org/TR/2015/REC-rdfa-lite-20150317/
RDFa Core 1.1 - Third Edition https://www.w3.org/TR/2015/REC-rdfa-core-20150317/
Scalable Vector Graphics (SVG) 2 https://www.w3.org/TR/2018/CR-SVG2-20181004/
RDFa 1.1 Primer - Third Edition https://www.w3.org/TR/2015/NOTE-rdfa-primer-20150317/
XML Inclusions (XInclude) Version 1.0 (Second Edition) https://www.w3.org/TR/2006/REC-xinclude-20061115/
XML Inclusions (XInclude) Version 1.1 https://www.w3.org/TR/2016/NOTE-xinclude-11-20160721/
Extensible Markup Language (XML) 1.1 (Second Edition) https://www.w3.org/TR/2006/REC-xml11-20060816/
Namespaces in XML 1.1 (Second Edition) https://www.w3.org/TR/2006/REC-xml-names11-20060816/

Don't let this list overwhelm you. It's easy. And half of the documents are relevant even if you ignore XML (for example, everything related to RDFa and SVG remains relevant for linting HTML even if the XML Syntax variant of HTML is ignored).

The way how XML works makes stuff easy: 95% of linting can be categorized as:

Syntactic linting performed by the Parser (and in case of XML, if this fails, you're allowed to stop).
Semantic linting using XML DTD or XML Schema

I haven't looked into its source code, but I bet that the Nu Validator might give a lot of inspiration about how to do this on various levels from architecture to implementation.

larrymjordan commented 4 years ago

Of course  is valid element. And that it's not a void element is irrelevant, as that distinction is only important for the HTML syntax of HTML documents, not the XML syntax for HTML documents. As the file suffix and the use of the xmlns attribute indicate, this file is an HTML document in XML syntax.  is empty italics in EmptyElemTag syntax. And both, Nu Valiator and xmllint agree on the well-formedness and validity of above sample document. The use case is typically in combination with something like Font Awesome, like .

But even if we assume HTML syntax, the error is wrong because still the  is a child of <li> and not <ul>.

@christianhujer Thanks for bringing this case to discussion. So this is a case of a misnested tag. Misnested tag is considered an error handling and strange case when parsing a HTML file. I have to clarify that Milo in this moment does not support XML parsing or processing. XHTML files are processed as HTML files and follow all process for parsing and tokenization expressed here

What is interesting about Nu validator is that "The parser is designed to work as a drop-in replacement for the XML parser in applications that already support XHTML 1.x content with an XML parser and use SAX, DOM or XOM to interface with the parser". In some ways this can inspire Milo to a broader scope in a future.

christianhujer commented 4 years ago

It's not a misnested tag from the perspective of the file. The file is correct. It's only a misnested tag from the perspective of Milo because Milo misinterprets the file as HTML in HTML Syntax instead of HTML in XML Syntax.

Still there is an issue from the perspective of the HTML Syntax, the error message is misleading. The error message should be that a close tag  was expected, but a close tag </li> was received.

paganotoni commented 4 years ago

I agree that this reviewers should not fire on this case. There is another reviewer in charge of unpaired tags which should point out the  issue.

The HTML in XML Syntax topic is one that will need more thought. Being Milo a side project and something we want to find useful initially I trend to think we will keep doing the assumption of HTML in HTML Syntax. One thing that we could do is being more explicit about this topic in the docs (README for now).

@christianhujer I really appreciate that you gave us a light with regards this topic of HTML in XML Syntax.

christianhujer commented 4 years ago

If I had time, I would love to help you with the implementation of the XML Parser. Alas, I'm quite occupied these days. But Milo will certainly be a long-term project, I can see that from what's there and from the changes. So, I'm sure one day it will get an XML Parser.

paganotoni commented 4 years ago

@larrymjordan I'm covering this one while removing the go query dependency.

wawandco / milo

False positive 0008:ol-ul/valid when <li/> element includes a EmptyElemTag #45