qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
27 stars 15 forks source link

fn:parse-html: Finalization #850

Open ChristianGruen opened 8 months ago

ChristianGruen commented 8 months ago

Now that fn:parse-html has been added to the specification, we need test cases for all provided options and input types (including binary input).

Looking at the current set of test cases, it seems unrealistic to use older libraries such as TagSoup for this function. I wonder if we should support ·implementation-defined· parsing algorithms at all. What do others think?

Next, is there any implementation available that supports all given method/html-version variants?

michaelhkay commented 8 months ago

I think we've gone over the top in terms of the number of options provided. I'd like to see implementations given a bit more flexibility and users given a bit less (spurious) control.

rhdunn commented 8 months ago

I'd be happy to simplify the option set to {"method": "html", "html-version": "5"} and any other vendor/implementation supported values with the description saying that 5 can refer to any of the W3C HTML 5.x RECs, or the WHATWG HTML LS spec. -- This then still allows the vendor to provide any other HTML parser they have (tagsoup, html tidy, etc.), or additional support for things like Microsoft Word flavoured HTML if they want.

Note: A conforming HTML 5/LS parser will be able to parse XHTML 1.0/1.1 documents into an XMLDocument per the https://html.spec.whatwg.org/#parsing-xhtml-documents section, including detecting those from the DOCTYPE/DTD.

I think its useful to keep the list of older HTML specs in some way for reference.

The encoding option can be useful, and is easy to integrate into the HTML parsing pipeline, as there is specific HTML spec language referenced for this behaviour.

The include-template-content option is also useful/necessary for interpreting the template element content. The HTML spec provides different rules for XSLT and XPath, which this option supports.

ChristianGruen commented 7 months ago

If we keep support for non-conforming parsers like TagSoup, I have some concerns that the result of the function will not be comparable to the output of other processors. It is also not testable via the test suite.

Is this something we want to live with, or shouldn’t we rather enforce conformance? For other results, people could still use vendor-specific extensions.

rhdunn commented 7 months ago

The intention w.r.t. parsers like TagSoup and HTML tidy is that they are intended to be vendor-specific extensions. The mentions in the spec are non-normative notes/examples. I can update the wording to make the relevant sections clearer.

michaelhkay commented 7 months ago

I think we should stick to the tradition (cf regular expressions) where our specifications set high expectations for conformance. That won't always ensure that implementors achieve the high standards we set, but that's their choice.

michaelhkay commented 6 months ago

I've been looking at this again. I think it's very unlikely that implementations will offer multiple options on how to parse the HTML, or that users will select the right options if they do.

The choice between HTML and XHTML is real, but reading the spec carefully it's not actually clear what method="xhtml" is expected to do.

I'm submitting a PR that does some editorial tidying up but it's not making any substantive changes.

rhdunn commented 6 months ago

Given we are simplifying this, I suggest just using an implementation defined version of HTML 5 - 5.2 and WHATWG HTML Living Standard. That will then take care of parsing the different older versions of HTML, including XML variants. It then allows implementations to use more recent versions of the living standard.

Having a method/parser selection is still useful for implementations that also provide their own HTML parsers -- e.g. MarkLogic support for HTML Tidy -- as it gives a standardized API to access those.

michaelhkay commented 6 months ago

So long as we have an options parameter, if we define it to follow the option parameter conventions, (see issue #955), we don't need to mention any vendor-specific options because they're covered by the general rules.

kosek commented 5 months ago

I think that supporting other parsing algorithms than HTML5 doesn't bring any additional value. Browsers are currently supporting also only this algorithm and using it for parsing all other older versions of HTML or for parsing XHTML served with a wrong media type. So supporting anything else than HTML5 parsing algorithm would be just additional burden to implementers.

michaelhkay commented 4 weeks ago

Droppping the "PR Pending" tag. PR850 has been accepted, but it claimed that it didn't entirely close this issue.