validator / htmlparser

The Validator.nu HTML parser https://about.validator.nu/htmlparser/
Other
56 stars 26 forks source link

Conform ambiguous-ampersand reporting to HTML spec #30

Closed sideshowbarker closed 3 years ago

sideshowbarker commented 4 years ago

This is another replacement change for the previous ambiguous-ampersand attempts. Unlike the previous attempts, I’ve actually run this one against the html5lib entity tests, and confirmed that it passes all.

Specifically, I ran the following:

java nu.validator.htmlparser.test.TokenizerTester ./html5lib-tests/tokenizer/entities.test
java nu.validator.htmlparser.test.TokenizerTester ./html5lib-tests/tokenizer/namedEntities.test

java nu.validator.htmlparser.test.TreeTester ./html5lib-tests/tree-construction/entities01.dat
java nu.validator.htmlparser.test.TreeTester ./html5lib-tests/tree-construction/entities02.dat

I also manually confirmed that this passes the failing tests from the trybot run at https://treeherder.mozilla.org/#/jobs?repo=try&revision=439fd78375fe4d83a5999e2e0c9d30075da4efb3&selectedTaskRun=dKhZOkS3SAqKuh4HMwuHfw.0 (which I think are actually just a subset of the html5lib tests).

So I think I can now assert with confidence that the patch in this PR:

hsivonen commented 4 years ago

Thanks! I triggered a new try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=216de62190b7d034640d584bfcf40c5a738b7da3

hsivonen commented 4 years ago

Still fails the entities02 test from html5lib.

hsivonen commented 4 years ago

Hmm. entities02 works in the Java-only case. I'll try to figure out what's going on.

sideshowbarker commented 4 years ago

Hmm. entities02 works in the Java-only case.

Right — I also can’t reproduce it with the Java parser

I'll try to figure out what's going on.

I know you can see the same thing at https://treeherder.mozilla.org/#/jobs?repo=try&revision=216de62190b7d034640d584bfcf40c5a738b7da3&selectedTaskRun=IiLxYbm0RKmR3wPSEDAE9A.0 which I’m looking at now — but for the record here, I see four failures:

expected "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gt0YY\""
    but got "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gt\ufffdYY\""
expected "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gt9YY\""
    but got "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gt\ufffdYY\""
expected "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gtaYY\""
    but got "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gt\ufffdYY\""
expected "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gtZYY\""
    but got "#document\n| <html>\n| <head>\n| <body>\n| <div>\n| bar=\"ZZ&gt\ufffdYY\""

So they have a common problem: An unexpected U+fffd is getting added to the DOM after the &gt.

hsivonen commented 4 years ago

This happens when the tokenizer is fed one UTF-16 code unit at a time. That's why the Java harness doesn't catch this.

When there is a buffer boundary between &gt0 and YY, the ambiguous ampersand state drops the current input character (digit zero) on the floor such that it doesn't get restored in the variable c when tokenization of the next buffer continues.

sideshowbarker commented 4 years ago

When there is a buffer boundary between &gt0 and YY, the ambiguous ampersand state drops the current input character (digit zero) on the floor such that it doesn't get restored in the variable c when tokenization of the next buffer continues.

So do you have a change in mind that would fix that? Or will it instead require more rewriting?

This happens when the tokenizer is fed one UTF-16 code unit at a time. That's why the Java harness doesn't catch this.

It seems like — for testing purposes going forward — it’d be good if we could replicate that one-UTF-16-code-unit-at-a-time behavior in the Java harness.

sideshowbarker commented 4 years ago

This happens when the tokenizer is fed one UTF-16 code unit at a time. That's why the Java harness doesn't catch this.

OK, I hacked https://github.com/validator/htmlparser/blob/master/src/nu/validator/htmlparser/io/Driver.java#L274 to manually force it, so that I could replicate that condition.

And with that in place I can, with the Java parser, reproduce the same test failures reported from the try run.

So at least now I’ve got a way to confirm whether any changes I test have actually fixed the problem.

sideshowbarker commented 4 years ago

When there is a buffer boundary between &gt0 and YY, the ambiguous ampersand state drops the current input character (digit zero) on the floor such that it doesn't get restored in the variable c when tokenization of the next buffer continues.

OK — I’ll try to see what rewrite I can do to avoid that.

But outside of this particular fix, I would also like to figure out if there’s some kind of guard we could add that would prevent me from making this same kind of mistake again — an assert I guess? (Though at this point I don’t have any idea what we’d need to assert to avoid this.)

It just seems like we’d really want to avoid it being possible to write some code that will behave in one way when we run it with the Java parser, but then behave in a different way when it’s run within Firefox.

sideshowbarker commented 4 years ago

A further comment just FYI and FWIW: Even before finding out this code had caused passed-with-the-Java-parser test cases to fail in Firefox I was worried there might be some assumptions/invariants I was breaking with how I’d written it. There’s just code smell that stands out with it. I realized even when writing it that it breaks significantly from the normal patterns that are used in multiple places in the existing code. But I wrote it that way because it was the only way I understood to get it to work.

However, I’ll admit that I yet don’t understand well the overall part of the existing code where this needs to be fit in — I mean the CONSUME_CHARACTER_REFERENCE, CHARACTER_REFERENCE_HILO_LOOKUP, and CHARACTER_REFERENCE_TAIL states. Among the aspects that lend to my confusion is the fact that while the spec defines only a single state from which to transfer into the AMBIGUOUS_AMPERSAND state, our current implementation ends up requiring us to manage four different places where we need to handle transitioning into the AMBIGUOUS_AMPERSAND state.

I understand, at high level, why the code takes a different — and more complicated — approach than the processing steps described in the spec. But because of that spec-vs-code difference in approach, I’ve found it a bit challenging to figure out how to reconcile the code with the spec description well enough to be able squeeze this patch in there and make it work.

sideshowbarker commented 4 years ago

When there is a buffer boundary between &gt0 and YY, the ambiguous ampersand state drops the current input character (digit zero) on the floor such that it doesn't get restored in the variable c when tokenization of the next buffer continues.

OK, thanks for pointing that out. I can see clearly now why what I’d written would break for that across-buffer-boundary case.

So in aa9a905 I reordered the handling to fix that. And that new change does fix all the cases that had failed during the try run at https://treeherder.mozilla.org/#/jobs?repo=try&revision=216de62190b7d034640d584bfcf40c5a738b7da3&selectedTaskRun=IiLxYbm0RKmR3wPSEDAE9A.0 — that is, all the cases in html5lib-tests entities02 that had failed. So that’s great.

But what’s not great is that it unfortunately seems to have now caused new regressions in the following test cases:

Well, to be more precise: Those tests still pass if I keep the tokenizer buffer size set to the current default (2048) — but with aa9a905, they now fail when I test with the tokenizer buffer size set to 1.

And to rule out the (slim) possibility that those test failures might be caused by our Java test-harness code, I also tested with the nu.validator.servlet.ParseTreePrinter code — and got the same results. So they’re clearly not due to a harness bug.

So at this point I don’t understand why those tests would be failing now — and why only when testing with the tokenizer buffer size set to 1 — but because those look like the kind of failures that happen if we’re not reconsuming somewhere where we should be, I tried changing the code such that we always do reconsume=true at the end of ampersandloop when transitioning back to returnState (for the case where the character being checked is something other than an alphanumeric or ;)…

…and sure enough, doing reconsume=true there causes those three test cases above to pass again.

Unfortunately, though, doing reconsume=true there causes a regression of all the entities02 tests that had failed during the try run. So that would kind of put things back to where we started…

So anyway, I’m now stuck again, at trying to figure out to fix those new regressions.

And in the meantime, it seems like if you were to do a try run with the updated patch, it’d end up failing those same cases. So I guess you shouldn’t bother doing a try run with it — unless you think there’s some possibility that those failures could be caused by a bug somewhere else in the Java code that wouldn’t be a problem for the translated code. (Which if so would still be a bug that we’d need to fix of course — but it wouldn’t need to prevent this patch from being landed.)

sideshowbarker commented 4 years ago

OK, I pushed 5b720a9 to this branch, and that fixes all the failing tree-construction tests from my previous comment, without introducing any regressions.

So I think the current state of the code in this branch is now sufficient for another try run.

If it works better for you to have single commit to trigger the try run with (rather than the 3 commits now in the branch), please let me know, and I can squash the commits down to one commit. Or even feel free to do that yourself.


I arrived at the 5b720a9 change after stepping through the rest of the current code and realizing that when leaving the AMBIGUOUS_AMPERSAND state, we do actually want to reconsume if we’re not within an attribute value — but we don’t want to reconsume if we’re within an attribute value.

Oddly however, even with the 5b720a9 change, I find that with the tokenizer buffer size set to 1, we actually have some tests in html5lib-tests tokenizer/entities.test that are failing:

Given that those tests are essentially functionally identical to tests in html5lib-tests tree-construction that are passing, I don’t understand why those tests would be failing — especially in the way they are.

Those tokenizer-test failures look like the kind of failures that happen if we’re not reconsuming somewhere where we should be — but with the 5b720a9 change, the code now is handling the reconsuming correctly that it had not been previously (as attested to by the failed tree-construction tests now passing). So I don’t understand what might be causing those failures.

Additionally, I tested with the nu.validator.servlet.ParseTreePrinter code — with the tokenizer buffer size set to 1 — and I (so far) have not been able to reproduce the failures with that.

So given that:

…I’m thinking it’s possible the failures may be due to some quirk in the nu.validator.htmlparser.test.TokenizerTester code.

So that’s why I think it’d now be useful to do a new try run. If that try run ends up hitting the same failures, then we know I still have a real problem to fix in this patch for the parser code. But if the try run doesn’t hit the same failures, then I guess that would indicate a problem in the nu.validator.htmlparser.test.TokenizerTester code to plan to investigate.

sideshowbarker commented 4 years ago

OK, I’ve now pushed 9814070 to this branch, and with that, the patch in this branch now passes all tests — including the tokenizer tests I mentioned in my previous comment.

So this patch is definitely now ready for another try run.

So please take a look — and I can squash the commits down to one commit after you review. For now, I’ve just kept the changes as a series commits, unsquashed, so that you can look at the incremental changes if you prefer.


I arrived at the 9814070 change after stepping back and just thinking more about how this state fits with the rest of the existing code — and in doing that, it finally dawned on me that we never want to stay in this AMBIGUOUS_AMPERSAND state (that is, in our as-implemented version of that state) and loop. Instead, we always want to transition back to returnState.

(I now actually remember that, a few iterations back, I had already realized that kind of had to be the case — but at that time I somehow neglected to take time to try rewriting it that way. I guess I was tripped up by having assumed it needed to loop.)

Anyway, so 9814070 sets the state to returnState at the beginning, and drops the looping — and all now works as expected.

I also pushed one more change, 16878b6, which just adds a comment that restates what I wrote above, and is otherwise a whitespace-only change (to re-ident the code after having dropped the ampersandloop loop that had been wrapping it.

hsivonen commented 3 years ago

Thanks. I now see a Firefox crash with the patch if I view source for:

<!DOCTYPE html>
<a href='http://example.org/demo?id=hello&copy=1&world=fun'>Link</a>
<p>Text: http://example.org/demo?id=hello&copy=1&world=fun</p>

I haven't yet figured out what goes wrong, but I intend to find out next.

hsivonen commented 3 years ago

Fortunately, the crash isn't in the portable Java code but in the assertions of the C++ View Source syntax highlighter. This patch causes a state transition that the highlighter logic doesn't anticipate. I expect to fix this in the highlighter.

hsivonen commented 3 years ago

Currently, I'm trying to understand if this formulation, which would fit the general structure of the tokenizer better, is equivalent:

                case AMBIGUOUS_AMPERSAND:
                    /*
                     * Unlike the definition is the spec, we don't consume the
                     * next input character when entering this state;
                     * that's because our current implementation differs from
                     * the spec in that we've already consumed the relevant
                     * character *before* entering this state.
                     * Also, our implementation of this state has no looping.
                     * So we never stay in this state; instead, we always
                     * transition out from it back to returnState.
                     */
                    if (c == ';') {
                        errNoNamedCharacterMatch();
                    } else if ((c >= '0' && c <= '9')
                            || (c >= 'A' && c <= 'Z')
                            || (c >= 'a' && c <= 'z')) {
                        appendCharRefBuf(c);
                        emitOrAppendCharRefBuf(returnState);
                        if ((returnState & DATA_AND_RCDATA_MASK) == 0) {
                            cstart = pos + 1;
                        }
                        reconsume = false;
                    }
                    state = transition(state, returnState, reconsume, pos);
                    continue stateloop;
hsivonen commented 3 years ago

I tried to understand how the AMBIGUOUS_AMPERSAND state worked, so I started simplifying the parts that looked like they could be simplified. Then, the only thing left was reporting the error, if c == ';'. Then I inlined that back to the states that transition to AMBIGUOUS_AMPERSAND.

The result does only two things:

  1. Changes error reporting (for sure doesn't break the DOM)
  2. Sends the semicolon like less-than in CONSUME_CHARACTER_REFERENCE.

See #50 for the result. Did I break anything?

sideshowbarker commented 3 years ago

See #50 for the result. Did I break anything?

Nothing broken. mvn test shows Failures: 0, Errors: 0, Skipped: 0, and integration with the HTML checker is working expected.

So, much thanks — I’m going ahead and closing this.

nschonni commented 1 year ago

Maybe naive, but would the ampersand check really only apply in phrasing content, and not in other places? That would ignore some attribute values though, where I think mainly the edge case this introduced was around query stings separating ampersands in attribute URL values

sideshowbarker commented 1 year ago

@nschonni Thanks for the suggestion and nudge. I now actually have a fix for this already. And it’s not complicated at all: It is really just, reverting https://github.com/validator/htmlparser/commit/9814070b66f5d48f7798f97820fac14d36950243. And that’s it.

That said, reverting https://github.com/validator/htmlparser/commit/9814070b66f5d48f7798f97820fac14d36950243 only fixes the Java version of the parser, which is what the HTML checker uses. It does not fix the parser used in Gecko/Firefox — that is to say, the C++ version of the parser, the source of which is generated from the Java sources.

To fix that C++ code, we’d need to go to back to where we were before https://github.com/validator/htmlparser/commit/9814070b66f5d48f7798f97820fac14d36950243 can start over with trying to fix it in way that doesn’t also regress the error-reporting behavior in the way I (unintentionally) regressed it with my https://github.com/validator/htmlparser/commit/9814070b66f5d48f7798f97820fac14d36950243 change.

But we don’t need to block on having the C++ code fixed. The main branch can continue to use the known-to-otherwise-be-working-as-expected code that it already has. And for the HTML checker, I can revert https://github.com/validator/htmlparser/commit/9814070b66f5d48f7798f97820fac14d36950243 on the validator-nu branch, and we can (re)switch back again to having the HTML checker be built from the validator-nu branch.

I aim to push the revert to the validator-nu by some time tomorrow.