jsdom / whatwg-url

An implementation of the WHATWG URL Standard in JavaScript
https://jsdom.github.io/whatwg-url/
MIT License
371 stars 94 forks source link

Track specific validation errors #260

Open karwa opened 1 year ago

karwa commented 1 year ago

From the table, I've only done the "URL parsing" section, not the IDNA or host parsing sections.

Having this in the reference implementation has already helped me confirm a couple of issues with the validation errors reported by the standard (bugs and PRs to WHATWG/url incoming). It can also be useful for other implementations to check the errors they report. Hopefully we can add expected validation errors to the WPT test suite in due course, as well.

That said, these results are not exposed to the live viewer yet. I've hacked something together locally, but I'm not sure of the cleanest way to thread that through to the viewer.

karwa commented 1 year ago

Related to #156 and #61

As for performance: we could use a bit-mask rather than an array of strings (since the standard only has a handful of unique validation errors). The cost is that we would lose some information, namely:

It would also be more limiting, as we would be unable to store associated information with each error (e.g. cursor position). But, in return, we would be able to store all errors that were encountered in a fixed amount of memory.

Another alternative would be for the parser to accept a validation error callback closure. Callers could supply a closure which collects all errors in an array, or which ignores them entirely, and the overhead in the parser should not be worse than any other virtual function call; the parser would just call the user-supplied function saying "hey, this error happened". Importantly, the caller would provide the memory to store errors, if they indeed want to store them.

karwa commented 1 year ago

I've implemented the closure approach on a branch (https://github.com/karwa/whatwg-url/commits/validation-error-callback) and hooked it up roughly to the live viewer.

The constructor for the URL type is generated from IDL or something and I'm not sure how to go around it, so you'll need to forward that argument manually by editing the generated constructor in lib/URL.js (line 115) and adding:

{
  let curArg = arguments[2];
  if (curArg !== undefined && typeof curArg === "function") {
    args.push(curArg);
  }
}

But yeah, it illustrates the approach. Since it's a callback function we can also add supplementary information about the error such as the pointer position, without having to worry about the memory overhead of storing all this stuff.

domenic commented 1 year ago

This is exciting work!

I think we only want to expose this capability through the low-level URL Standard API, not the new URL() API. That should bypass any worries about conflicting with the generated code. It does mean the live viewer would need to switch which API it uses, but that should be fine.

Other things we'd need to do before merging this:

It may also be worth benchmarking the results of a no-op default onValidationError, versus a null/undefined default onValidationError that is always called with this.onValidationError.?(...).

domenic commented 1 year ago

I also like the onValidationError approach because we could avoid the performance cost of #59 by doing something like

if (this.onValidationError && !isURLCodePoint(c) && c !== p("%")) {
   this.onValidationError("invalid-URL-unit");
}

instead of just

if (!isURLCodePoint(c) && c !== p("%")) {
   this.onValidationError.?("invalid-URL-unit");
}

There are a few other instances of that pattern as well that might be worth optimizing out, e.g.

    if (c === p("%") &&
        (!infra.isASCIIHex(this.input[this.pointer + 1]) ||
         !infra.isASCIIHex(this.input[this.pointer + 2]))) {
      this.onValidationError("invalid-URL-unit");
    }

could remove collapse the three checks down into one.