Add XML Parsing with `DocXmlElement`

microsoft / tsdoc

A doc comment standard for TypeScript

https://tsdoc.org/

MIT License

4.7k stars 130 forks source link

Add XML Parsing with `DocXmlElement` #331

Closed suneettipirneni closed 1 month ago

suneettipirneni commented 2 years ago

Closes #138

This PR adds the ability for tsdoc to parse XML elements. For example the following input:

/*
 * <warning>
 *  This is a <bold>destructive</bold> operation.
 * </warning>
 */

Would yield this AST:

- Comment
  - Section
    - Paragraph
      - XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="warning"
        * XmlStartTag_ClosingDelimiter=">"
        - PlainText
          * PlainText="this is a "
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="bold"
          * XmlStartTag_ClosingDelimiter=">"
          - PlainText
            * PlainText="destructive"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="bold"
          * XmlEndTag_ClosingDelimiter=">"
        - PlainText
          * PlainText="operation."
        * XmlEndTag_OpeningDelimiter="</"
        * XmlEndTag_Name="warning"
        * XmlEndTag_ClosingDelimiter=">"

Behavior

Unlike the current version of the parser, start and end tags alone aren't considered valid components. Take the following input:

<a>b</c>

Currently this input will result in a DocHtmlStartTag for <a>, DocPlainText for b, and DocHtmlEndTag for </c>. Each of these nodes viewed in isolation are valid syntax, however, when parsed together as a unit they represent invalid XML. Thus, with the new XML parser this example would be considered erroneous:

Expecting closing tag name to match opening tag name, got "c" but expected "a"

As for its representation within the AST, the given example would not have anything resembling a DocXmlElement instead it would just be DocErrorText. Once again, this is because as a whole the input cannot be considered valid XML, therefore it's not constructed in any way as an XML node.

That being said, I think my current implementation can be improved to provide more text even in erroneous code.

Some Possible Changes

To align with the XML spec, XML nodes can have one of two nodes

PlainText
XmlElement

However, in writing the parser I was unsure about a possible third case: SoftBreak.

Softbreaks are valid children within XML elements. Currently softbreaks aren't parsed as individual nodes, but as a part of a PlainText node.

For example, the input:

<a>b
</a>

Would generate the following AST:

- XmlElement
        - PlainText
          * PlainText="b\n"

Alternatively the AST could be:

- XmlElement
        - PlainText
          * PlainText="b"
        - SoftBreak
          * SoftBreak="\n"

I'd love to hear feedback on which approach is preferred here.

Additionally, XML has special escape characters. Should we parse something like the following:

<a>1 is &lt; 100</a>

- XmlElement
        - PlainText
          * PlainText="1 is < 100"

I would also appreciate feedback on this as well.

ghost commented 2 years ago

All CLA requirements met.

octogonz commented 2 years ago

@suneettipirneni I finally got some time to look at this PR tonight -- it is really good! 👍👍 Very cool that you added lots of test coverage. I need a little more time to study the logic in NdoeParser.ts, but overall this PR is on the right track.

octogonz commented 2 years ago

However, in writing the parser I was unsure about a possible third case: SoftBreak.

According to the current design of the AST, I believe that \n is mostly forbidden in DocPlainText: https://github.com/microsoft/tsdoc/blob/d8ce4aedcf64b5c948e7152ba02ac6c7bdedc9be/tsdoc/src/nodes/DocPlainText.ts#L49-L52

(I forget why this constraint was introduced. Maybe so that visitors that process DocPlainText do not have to worry about CRLF vs LF line ending differences?)

octogonz commented 2 years ago

Additionally, XML has special escape characters. Should we parse something like the following:

Yes, in fact HTML entities are an important feature for representing arbitrary Unicode characters in an ASCII source file. CommonMark allows them basically anywhere except for code spans, not just inside an XML element.

But the problem is bigger than this PR. For example, how best to represent an HTML entity inside an attribute value? And should TSDoc allow entities inside other context such as {@link ____}?

The problem is not super hard, but a rigorous specification of escaping syntaxes was one of the details that I thought we should figure out before writing the TSDoc spec.

suneettipirneni commented 2 years ago

When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find()

octogonz commented 2 years ago

When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find()

It's a helpful API, but I would suggest to expose it as a secondary API, something like DocXmlElement.tryGetXmlAttribute(name): XmlAttribute | undefined.

We could convert the main model to be xmlAttributes: Map<string, DocXmlAttribute>, but this has some downsides:

If we expose the Map directly, there's no way to validate items before they are added.
Even if it is a ReadonlyMap, often you also want to validate keys before they are looked up. (one example from this repo)
The order in which attributes appear can be syntactically useful to know, even if it has no semantic meaning. Map does not expose that very well.

In other words Map is the best way to implement lookups, but it is often not the best class to expose as public API contract.

octogonz commented 2 years ago

(let us know when this PR is ready for another look)

octogonz commented 1 year ago

@suneettipirneni I greatly appreciate your patience with this long review. 😇🙏

I think this feature is ready to merge, except for a couple final questions:

1. Backwards compatibility

In testing, I realized that the upgraded tsdoc-config library cannot read old tsdoc.json files. This means that if several different tools are in use, mixing tsdoc-config versions would make it impossible to parse the file. The tools would all need to be upgraded together. I think we can make the upgrade experience a little easier by relaxing the schema/loader to accept the old setting names. (Then we can remove this backwards compatibility later, for example when we publish the 1.0.0 version of the package.)

I will do this work and push it to your branch.

2. Nested content

The handling of nested HTML content has changed with this PR.

Old behavior:

New behavior:

The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim (<b>, <blockquote>) but maybe not (<style>, <meta>). A reasonable conservative policy is to not render the content unless the tag is recognized.

Nonetheless, the example in the above screenshots is probably counterintuitive for end users. I want to think a bit about how to handle this without adding nontrivial additional work to your PR. If you have any ideas, let me know.

@zelliott @adventure-yunfei @AlexJerabek @dzearing @chengcyber

octogonz commented 1 year ago

The handling of nested HTML content has changed with this PR.

🤔 Maybe what we need is a tsdoc.json setting that can specify:

"If your tool doesn't implement this XML element name, it's safe to process the element's content as if the element's tags were removed."

Here's a more complicated example:

/**
 * <notebox>
 *   This is some <b>sample text</b>.
 *   <table-of-contents>
 *     <nav-node-name>My API Item</nav-node-name>
 *     <category>ui-controls</category>
 *   </table-of-contents>
 * </notebox>
 */

...which ideally might be rendered as:

This is some sample text.

In this example, <notebox> and <b> can be safely ignored, whereas <table-of-contents>, <nav-node-name>, and <category> are data properties whose contents should NOT be rendered.

In tsdoc.json we would somehow define these element names and classify them as safe (e.g. contentContainer=true). And then our default settings could apply this classification to common HTML elements such as <a>, <b>, <i>, etc.

zelliott commented 1 year ago

The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim (<b>, <blockquote>) but maybe not (<style>, <meta>). A reasonable conservative policy is to not render the content unless the tag is recognized.

Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc? I'm under the impression that TSDoc doesn't contain any rendering logic itself. So if the above behavior is unintuitive, wouldn't that mean fixing how the TSDoc website handles certain XML elements?

With the tsdoc.json setting proposal, I think you're suggesting that TSDoc can help inform a consumer which XML tags can have their contents rendered. There can be reasonable default settings (i.e. common HTML elements), but it can also be configurable. I suppose this is useful if the author of the TSDoc comments and the consumer rendering them are different people... is this a common use case?

suneettipirneni commented 1 year ago

In tsdoc.json we would somehow define these element names and classify them as safe (e.g. contentContainer=true). And then our default settings could apply this classification to common HTML elements such as <a>, <b>, <i>, etc.

To some degree this option feels a bit out of place to me. My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the rendered representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a contentContainer node and what isn't?

I also fear this field may be confusing for consumers of TSDoc especially in cases where additional data like attributes play a key role. If an <a> element is marked as a contentContainer and my custom markdown emitter automatically outputs the plaintext of any content-container element I could inadvertently skip over the href attribute.

for (const node of nodes) {
    switch (node.kind) {
        case DocNodeKind.XmlElement:
            const xmlNode = node as DocXmlElement;
            if (xmlNode.contentContainer) {
                const text = // Logic to resolve the inner plain text(s) and convert them to strings ... ;
                output.append(text);
                break;
            }

            // Otherwise render recognized XML elements
    }
}

While this code is ultimately a logic error by the programmer (you could argue that custom rendered elements should be checked first) it may be caused by the somewhat valid assumption that a is a contentContainer since it only contains text as child elements. This could also become an issue if there's lack of shared knowledge between the programmers writing the docs and the programmers creating emitters.

As for the playground, I think having it render common xml tags would be ideal. Or it could do what Monaco does and render any XML inner text as plaintext, with the exception of elements like <style>.

octogonz commented 1 year ago

Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc?

Yes. Here's some examples of various possible "renderers" of TSDoc:

The TSDoc playground shown in those screenshots
An API reference website generated using API Documenter
A custom web app that displays API documentation by parsing TSDoc from .api.json files produced by API Extractor
It doesn't need to be a website: for example, a VS Code extension that shows rich tooltips by parsing TSDoc
It doesn't even need to render anything: for example, a lint rule that counts how many English words are in your TSDoc summary

Importantly, multiple different tools may need to process the same code comment, which is why interoperability between tools is a key requirement for TSDoc.

I'm under the impression that TSDoc doesn't contain any rendering logic itself.

Yes. But TSDoc does provide a config file that allows users to define custom tags (both JSDoc tags and XML tags). This configuration enables the parser to report syntax errors about undefined tags, and (for JSDoc tags at least) it distinguishes "undefined" tags versus "unsupported" tags:

undefined = not a recognized tag in the TSDoc dialect that was configured by the person who wrote the comment
unsupported = defined but not implemented by a particular tool that may be inspecting the file.

TSDoc's own standard tags are all "defined" by default, whereas most tools will only "support" a subset of them.

My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a contentContainer node and what isn't?

Well, the original question was:

"If a tool doesn't implement an XML element, what should it do?"

The simple, safe answer is to ignore the element and all of its content.

But suppose I invent a custom tag <glossary-word> and use it to mark up sentences like This is a <glossary-word>book<glossary-word>. If a tool doesn't support this tag, deleting the content would be a bad experience. Do we really need to release new versions of every tool to ensure that <glossary-word>'s content can be rendered safely?