inikulin / parse5

HTML parsing/serialization toolset for Node.js. WHATWG HTML Living Standard (aka HTML5)-compliant.
MIT License
3.64k stars 232 forks source link

`TreeAdapter` interface types are unsound #1253

Open kwshi opened 2 weeks ago

kwshi commented 2 weeks ago

summary

The current API allows customizing parsing behavior using the TreeAdapter and TreeAdapterTypeMap interfaces, which I think is very cool. However the type declarations on these interfaces are overly relaxed; in particular, the parsing logic implicitly expects certain node types to be related to each other (e.g., every element is a parentNode), but these relationships are not at all enforced by the TreeAdapter type declarations. Consequently, the types of arguments passed by the parse5 internal parsing code to the custom TreeAdapter may have completely inconsistent types with those expected by the TreeAdapter's own type declarations.

extreme example

To illustrate my point with an extreme example, consider the following, which is semantically not a faithful/good tree adapter but nevertheless type-checks without error (on parse5 version 7.1.2, TS 5.4.5). In particular, each "node type" is just a literal string type.

import * as P5 from "parse5";

const minimalAdapter: P5.TreeAdapter<{
  childNode: "child";
  commentNode: "comment";
  document: "document";
  documentFragment: "fragment";
  documentType: "doctype";
  element: `element/${string}`;
  node: "comment" | "text" | "doctype" | `element/${string}`;
  parentNode: "parent";
  template: "template";
  textNode: "text";
}> = {
  adoptAttributes() {},
  appendChild() {},
  createCommentNode() {
    return "comment";
  },
  createDocument() {
    return "document";
  },
  createDocumentFragment() {
    return "fragment";
  },
  createElement(tag) {
    console.log("create element", tag);
    return `element/${tag}`;
  },
  // `createTextNode` is part of the interface in parse5 master branch
  // but not yet in 7.1.2 release, which is what i'm testing on
  //   createTextNode(value: string){ return "text"; },
  detachNode() {},
  getAttrList() {
    return [];
  },
  getChildNodes() {
    return [];
  },
  getCommentNodeContent() {
    return "(commentContent)";
  },
  getDocumentMode() {
    return P5.html.DOCUMENT_MODE.NO_QUIRKS;
  },
  getDocumentTypeNodeName() {
    return "(doctypeName)";
  },
  getDocumentTypeNodePublicId() {
    return "(docTypePublicId)";
  },
  getDocumentTypeNodeSystemId() {
    return "(docTypeSystemId)";
  },
  getFirstChild() {
    return null;
  },
  getNamespaceURI() {
    return P5.html.NS.HTML;
  },
  getNodeSourceCodeLocation() {
    return null;
  },
  getParentNode() {
    return null;
  },
  getTagName() {
    return "(tagName)";
  },
  getTemplateContent() {
    return "fragment";
  },
  getTextNodeContent() {
    return "(textContent)";
  },
  insertBefore() {},
  insertText(parent: "parent", text) {
    console.log("insert text:", { parent, text });
  },
  insertTextBefore() {},
  isCommentNode(node): node is "comment" {
    return node === "comment";
  },
  isDocumentTypeNode(node): node is "doctype" {
    return node === "doctype";
  },
  isElementNode(node): node is `element/${string}` {
    return node.startsWith("element/");
  },
  isTextNode(node): node is "text" {
    return node === "text";
  },
  setDocumentMode() {},
  setDocumentType() {},
  setNodeSourceCodeLocation() {},
  setTemplateContent() {},
  updateNodeSourceCodeLocation() {},
};

P5.parseFragment("foobar", { treeAdapter: minimalAdapter });

extreme example explanation

Pay attention specifically to:

However, running the code results in the following output:

insert text: { parent: 'element/html', text: 'foobar' }

which is inconsistent with the type declarations, even though there is no type-checking error. This happens because the actual way that this parsing behind the scenes looks something like this (not exactly; I'm omitting some details to keep this discussion a little simpler, but I'm pretty sure the basic idea is right):

The problem here is that createElement produces a node of type element, while insertText expects a node of type parentNode, but there is no requirement that all elements are actually parentNodes (in technical lingo, element needs to be a subtype of a.k.a. extend parentNode), and indeed, that is the requirement we constructed this example to violate.

related problems

The example above showcases just one of many unspoken type requirements. There are probably many, but here are just a few more I can think of:

who cares?

You could argue that my example is pathological and unrealistic, and that's true. Indeed, most of these unspoken constraints are satisfied just fine by the default, built-in tree adapter, so for most people these type unsoundness errors are moot. Still, I think that if we're going to go to these lengths (custom tree adapters & adapter types) to make parse5 customizable anyway, then we might as well do it right, and currently the types are not right.

how do we fix it?

I'm not totally sure! Some of the issues are partially addressed by adding extends constraints to ensure that, e.g., Element extends ParentNode. However, adding these constraints to TreeAdapterTypeMap's type parameters is insufficient because this way they can be bypassed simply by directly constructing the type map via a literal, which is already what my example above does. So to really enforce these constraints they have to occur directly on TreeAdapter itself, which would require a very annoying rewrite that lifts all the node type parameters directly onto the TreeAdaptertype itself. That would make the code a lot more verbose and break backwards-compatibility.

On top of that, it also doesn't address the issue of templates, which arise from technically unsound use of createElement to create a template node type, given specific input arguments. Simply constraining Element extends Template can address the type errors but doesn't make sense and conceptually isn't what we want; the more correct fix would be instead to have a completely separate createTemplateNode method and use that in the parsing logic instead when constructing templates. Again, API-breaking change.

Anyway, I don't have a simple fix that doesn't break stability in lots of ways, but I'm raising this issue anyway for discussion because there is a real typing issue, and I do care a lot about types. If we ultimately don't implement/enforce these type constraints/requirements in TypeScript because of stability or other technical reasons, I think they deserve at least a mention in the documentation so other folks trying to write their own TreeAdapter know what compatibility requirements they need in order to avoid running into unexpected type errors. :)

43081j commented 1 week ago

The types aren't quite right anyway, outside of all this. The concept of a type map has some problems and weakens types in some situations

A parent node is meant to be a union of all the possible types which have parents. That shouldn't really change but we could possibly introduce a base type (like WithParent)

Meanwhile, template inherits element currently (and probably should)

Ultimately I think the actual fix one day is to make tree adapters less dynamic/flexible and require them to implement certain fundamental types

Id be open to contributions here as I won't have much time over the next couple of months