Discussion thread about how to structure tests that rely on the ordering of various steps

ptomato commented 2 years ago

This was raised in the last test262 maintainers meeting by @gibson042. We agreed to start a thread about it on GitHub since in the maintainers meeting we are focusing on other topics first.

Richard pointed out that due to the "one test per file" convention that consumers of test262 expect, it's difficult (and costly in terms of developer time) to test a sequence of steps, each of which may result in an abrupt completion. The example Richard gave was https://github.com/tc39/test262/pull/2746

To distill my understanding of the problem into a toy example (Richard, correct me if I got it wrong) — let's assume you have spec text like this:

Foo.bar ( a, b, c )

The following steps are performed:

Let one be ? ToNumber(a).
Let two be ? ToNumber(b).
Let three be ? ToString(c).
If ℝ(one) > ℝ(two), throw a RangeError exception.

The problem is to test not only that all the appropriate exceptions are thrown in each of steps 1 through 4, but also that the correct one is thrown in the case where multiple conditions are present that would cause any of these steps to throw.

Questions that I can think of:

What's the priority of testing that these steps are executed in the correct sequence? For example,
- an implementation is nonconforming if it executes step 4 before step 3 and throws a RangeError on Foo.bar(2, 1, Symbol('')), but is that something that we wouldn't consider coverage to be complete without?
- another example nonconformance would be to execute step 2 before step 1, from a spec point of view it would technically only be observable if b had a [Symbol.toPrimitive]() or valueOf() method, but in practice it would probably be immediately apparent to a user by the error message (which is not part of the spec)
Are there ways to facilitate testing these steps that don't involve breaking expectations of existing consumers?

ljharb commented 2 years ago

The priority is very high imo, and yes, coverage is incomplete unless the order of all observable exceptions is covered.

leobalter commented 2 years ago

One thing that I always wanted that helps unblocking this issue is adapting the assertions and test262 runners to not break fast on the first failure.

Today, all assertions are designed to throw an exception on the first failure, as @ptomato pointed out.

@gibson042 is familiar with QUnit's assertion library with failures are only registered.

Some challenges here to resolve this: most runners have limited output from the engines and the throwing assertion output is the resource often used to capture if a test file is passing. Other tests require being in the top level scope (the suite is meant for a language, not a library), this means the tests grouping might be limited to built-ins mostly. Tests for property aspects might still need to be separated. Grouping tests with async behaviors might require some extra manual coordination.

Tests should remain atomic, meaning they should not have effects on other tests. The file separation makes this a pretty easy mode as each file is supposed to have a clean engine state. Grouping them might require some care to avoid side effects that might affect results.

All of this is possible to resolve and work around.

Saying that, I think the order execution should never be a concern. It helps more if a runner can count how many assertions are present and how many fail or pass. The assertions are responsible to capture results are happening within the expected spec (not test) order.

Cons aside, a great pro of grouped tests is having some in-file coverage checkpoints.

I hope this works for everyone! I always wanted this to happen but always got pushback for many reasons but also realizing it would take me more some extra time to complete.

ljharb commented 2 years ago

Tests should remain atomic, meaning they should not have effects on other tests.

This is not currently the case, since tests don't clean up after themselves. I'd love it to be the case, though.

fwiw, I've found that test runners that do not throw exceptions on failures are much easier to work with, precisely because one failure doesn't prevent you from receiving further failure output.

leobalter commented 2 years ago

Tests should remain atomic, meaning they should not have effects on other tests.

This is not currently the case, since tests don't clean up after themselves. I'd love it to be the case, though.

Tests don't clean up and should not clean up. The runner does the clean up in the most aggressive way: reloads the whole engine.

This clean up might be somewhat dangerous, such as restoring a value to a property being tested. It seems that we still agree with the tests need to atomic, despite you agree if they currently are.

fwiw, I've found that test runners that do not throw exceptions on failures are much easier to work with, precisely because one failure doesn't prevent you from receiving further failure output.

I 100% agree because of the reason you also pointed out and I'm very biased on it working for so long with QUnit and libraries using it. I believe @gibson042 will have the same opinion here.

ljharb commented 2 years ago

"should not clean up" is the position of test262, but it's not the way tests work literally anywhere else. Tests should clean up after themselves in every ecosystem and language and test framework I'm familiar with.

leobalter commented 2 years ago

My opinion does not represent the position of Test262, I'm also not setting any blockers here.

rwaldron commented 2 years ago

fwiw, I've found that test runners that do not throw exceptions on failures are much easier to work with, precisely because one failure doesn't prevent you from receiving further failure output.

A given Test262 runner doesn't have to fail the way you describe here. Runners shell out to engines and capture stdin/stdout/sterr... what it does with the captured results after is entirely up to the runner author. Tests have to be run this way because all test code must run in a fresh global context and shelling out to engines is the easiest and only universal way to do that, since not all engines ship mechanisms to create and destroy top level global objects from within the execution environment itself.

ljharb commented 2 years ago

@rwaldron thanks, that's helpful context. can you elaborate on which engines do not ship a mechanism for that?

rwaldron commented 2 years ago

@ljharb since it's not a normally specified mechanism with an exposed API surface, it could very well be all of them. Out of all of the engines I've written support for in eshost, only Hermes has no mechanism at all, ie: cannot create a realm, cannot create a script load object, cannot create a global object, cannot evaluate script via some non-eval mechanism. The rest have varying support (like the list of possible things I just wrote in the last sentence) that I work around to make eshost function correctly across all engines.

For what it's worth, I'm currently writing test generators that use Test262 material as a source, then consolidating the desired tests into a single file, each wrapped in a sandboxed invocation (which is effectively the same as shelling to an engine, since a fresh sandbox is created each time) and it's been very effective.

leobalter commented 2 years ago

I was re-reading the thread here and there is a thing bugging me:

Richard pointed out that due to the "one test per file" convention that consumers of test262 expect, it's difficult (and costly in terms of developer time) to test a sequence of steps

...

The problem is to test not only that all the appropriate exceptions are thrown in each of steps 1 through 4, but also that the correct one is thrown in the case where multiple conditions are present that would cause any of these steps to throw.

This seems wrong. Many tests already observe this today and grouping many tests into a file doesn't help or prevent the challenge of asserting the correct sequence of steps.

For a quick example, if I want to assert the order of the first two steps:

1. Let one be ? ToNumber(a).
2. Let two be ? ToNumber(b).

let called = 0;
const a = { valueOf() { called += 1; return 42; } };
const b = { valueOf() { throw new Test262Error(); } };
const c = { valueOf() { throw 'something else'; } };

assert.throws(Test262Error, () => Foo.bar(a, b, c));
assert.sameValue(called, 1);

This test itself only asserts an abrupt completion ending in Step 2 happening after step 1. This is already a practice used in Test262.

I wonder how exactly it benefits of being grouped in the same test file of an assertion that tests different exists of Foo.bar?

let called = 0;
const a = { valueOf() { throw new Test262Error(); } };
const b = { valueOf() { called += 1; throw 'something else'; } };
const c = { valueOf() { called += 1; throw 'something else'; } };

assert.throws(Test262Error, () => Foo.bar(a, b, c));
assert.sameValue(called, 0);

This new test above asserts an abrupt completion from the first step, and assertions no other ToNumbers are called.

Once again, they are independent.

There is a benefit of grouping tests in a single file, but I don't see any relating to verifying the order of spec steps.

ljharb commented 2 years ago

@rwaldron Isn't it just delete for globals? For realms and tests covering the global object directly etc, that is a fair point, but it seems like the majority of tests don't need that.

rwaldron commented 2 years ago

@ljharb tests regularly "monkey patch" parts of built-ins to catch specific observable behaviors occurring. A really good example of where that happens is in the Promise tests.

ljharb commented 2 years ago

@rwaldron sure, but i'm not aware of any engine that fails to permit such monkey patching, and the ability to monkey patch is (typically) the same ability to reverse the monkey patches (altho to be fair, making things nonconfigurable wouldn't be reversible). I frequently do this sort of thing in tests, and test/mocking frameworks all tend to have an "after" or "cleanup" mechanism to queue up the restoration code.

rwaldron commented 2 years ago

@ptomato I have to respectfully disagree with your claim in the original post. Here are 5 tests that exercise step 1, 2, 3 and 4; where 4 is tested two ways:

/* 

Foo.bar ( a, b, c )
The following steps are performed:

Let one be ? ToNumber(a).
Let two be ? ToNumber(b).
Let three be ? ToString(c).
If ℝ(one) > ℝ(two), throw a RangeError exception.

*/

// Given...
const Foo = {
  bar(a, b, c) {
    let one = Number(a);
    let two = Number(b);
    let three = String(c);
    if (one > two) {
      throw new RangeError();
    }
  }
};

// Test 1
{
  let valueOfCount = 0;
  const badPathError = new Error('This should never be reached.');
  const a = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
      throw new Test262Error('This is the only path to reach');
    }
  }
  const b = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      throw badPathError;
    }
  }
  const c = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      throw badPathError;
    }
  }
  assert.throws(Test262Error, () => {
    Foo.bar(a, b, c);
  });
  assert.sameValue(valueOfCount, 1);
}

// Test 2
{
  let valueOfCount = 0;
  const badPathError = new Error('This should never be reached.');
  const a = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
    }
  }
  const b = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
      throw new Test262Error('This is the only path to reach');
    }
  }
  const c = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      throw badPathError;
    }
  }
  assert.throws(Test262Error, () => {
    Foo.bar(a, b, c);
  });
  assert.sameValue(valueOfCount, 2);
}

// Test 3
{
  let toStringCount = 0;
  let valueOfCount = 0;
  const badPathError = new Error('This should never be reached.');
  const a = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
    }
  }
  const b = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
    }
  }
  const c = {
    toString() {
      toStringCount++;
      throw new Test262Error('This is the only path to reach');
    }, 
    valueOf() {
      throw badPathError;
    }
  }
  assert.throws(Test262Error, () => {
    Foo.bar(a, b, c);
  });
  assert.sameValue(toStringCount, 1);
  assert.sameValue(valueOfCount, 2);
}

// Test 4
{
  let toStringCount = 0;
  let valueOfCount = 0;
  const badPathError = new Error('This should never be reached.');
  const a = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
      return 1;
    }
  }
  const b = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
      return 2;
    }
  }
  const c = {
    toString() {
      toStringCount++;
    }, 
    valueOf() {
      throw badPathError;
    }
  }
  assert.sameValue(Foo.bar(a, b, c), undefined);
  assert.sameValue(toStringCount, 1);
  assert.sameValue(valueOfCount, 2);
}

// Test 5
{
  let toStringCount = 0;
  let valueOfCount = 0;
  const badPathError = new Error('This should never be reached.');
  const a = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
      return 2;
    }
  }
  const b = {
    toString() {
      throw badPathError;
    }, 
    valueOf() {
      valueOfCount++;
      return 1;
    }
  }
  const c = {
    toString() {
      toStringCount++;
    }, 
    valueOf() {
      throw badPathError;
    }
  }
  assert.throws(RangeError, () => {
    Foo.bar(a, b, c);
  });
  assert.sameValue(toStringCount, 1);
  assert.sameValue(valueOfCount, 2);
}

Imagine debugging that if those were all in the same file? What if an implementation had a bug in only step 2? Usually implementations will put test file paths into "skip lists" until they can get around to fixing the bug. This would mean that all 4 steps are now in a skip list somewhere... instead of just "Test 2" for step 2. Also, I don't know about any of you, but I get lost somewhere around Test 3.

ptomato commented 2 years ago

I don't have any claim here, I was trying to express Richard's point from our last meeting. It's possible I haven't done that correctly.

rwaldron commented 2 years ago

@ptomato then I guess I'm saying: given all specs I've read, with the exception of WeakMap and FinalizationRegistry, there is no specification that cannot have its steps tested in sequence with isolation.

rwaldron commented 2 years ago

@ljharb

I frequently do this sort of thing in tests, and test/mocking frameworks all tend to have an "after" or "cleanup" mechanism to queue up the restoration code.

Yes, I'm sure everyone here is deeply and broadly versed in the use of before and after mechanisms. What I'm saying is that implementors of JavaScript engines were the first (and remain the most important) consumers of Test262, and all of them have made clear that tests must be directly runnable as a standalone files, as they do in their own test systems. (Additionally, they want to be able to substitute the harness files for their own implementations—whether it be other js implementations, or built directly into their engine's host environment). In order to make that work for all engines, who also all have their own runners which are tailored to their org's specific needs, the most reasonable approach has been and remains: providing as reasonably granular as possible breakdown of specification algorithm steps into individual files for testing.

ptomato commented 2 years ago

@rwaldron My understanding of Richard's point was not that he was saying it was impossible, but that it was costly. I'm just the rando who volunteered to start a thread about it at last week's meeting :shrug: I think it's worth spending some time on brainstorming solutions that could mitigate that cost, as long as we're clear that they have to fit within the known constraints of test262.

It was definitely not my intention to stage a rehash of the "multiple files vs. monolithic files" argument, and I apologize if I phrased something in a way that contributed to that.

leobalter commented 2 years ago

Just throwing an idea for many tests in a single file:

// this should be in a helper file
function test262Test /* bikeshed the name later */ (name, testFn) {
  const t = globalThis.assert;
  const report = {
    name,
    failures: [],
    assertions: 0
  };

  const assertObj = {
    sameValue(actual, expected, message) {
      report.assertions += 1;
      try {
        t.sameValue(actual, expected, message);
      } catch (err) {
        report.failures.push(err);
      }
    },
    /* repeat something similar for other assertions here */
  };

  testFn(assertObj);

  return report;
}

function captureMultiple(results) {
  // Assert: results is an Array object

  return function test(name, testFn) {
    results[name] = {
      testFn: testFn.toString(),
      report: test262Test(name, testfn)
    };
  };
}

function Test262Report(results) {
  // a lot of room for improvement here
  for (const [report: { failures: length } ] of results) {
    if (length) {
      throw new Test262Error(results)
    }
  }
}

// a test file that preloads the helper file above

// I create my own results object
const results = [];

// And I create my own test function using the results array.
const test = captureMultiple(results);

test("foo bar step 1", (assert) => {
  let called = 0;
  const a = { valueOf() { called += 1; return 42; } };
  const b = { valueOf() { throw new Test262Error(); } };
  const c = { valueOf() { throw 'something else'; } };

  assert.throws(Test262Error, () => Foo.bar(a, b, c));
});

// At the end I can capture all the results and filter for any failures. 

Test262Report(results);

// TODO: missing support for async
// limits: it only works for built-in tests that won’t modify any global property or won’t mess with the global.

rwaldron commented 2 years ago

@ptomato

(Quick tone check on myself: I want to make sure you know I'm replying in good faith and by no means intend for any of this to be interpreted as hostile. ❤️ )

Costly in what sense? Is it more costly for authors to be more precise about how tests are written, or for implementations to have ignored, skiplisted (https://github.com/v8/v8/blob/main/test/test262/test262.status or https://github.com/WebKit/WebKit/blob/main/JSTests/test262/expectations.yaml ... maybe don't look for Temporal 😵‍💫 ) failures for years?

I think it's worth spending some time on brainstorming solutions that could mitigate that cost, as long as we're clear that they have to fit within the known constraints of test262.

You mentioned not wanting to rehash "multiple files vs. monolithic files" and I do appreciate that, but from my perspective spending anymore time discussing changes to how Test262 tests are written is just rehashing

rwaldron commented 2 years ago

With regard to the skiplists I provided: imagine if each of those listed files contained tests that were trying to test a bunch of steps in an algorithm, where maybe only one step was wrong? Much larger swaths of coverage would be lost as a result.

leobalter commented 2 years ago

It was definitely not my intention to stage a rehash of the "multiple files vs. monolithic files" argument, and I apologize if I phrased something in a way that contributed to that.

@ptomato I'm now confused with what's your intention here about multiple files.

In the first message you pointed out an issue (#2746) where we discuss breaking a test file into smaller files with chunks of it, which are not related to specs order but report granularity. And then I also see:

"one test per file" convention that consumers of test262 expect, it's difficult (and costly in terms of developer time) to test a sequence of steps, each of which may result in an abrupt completion.

So I tried to show an example and remember Test262 already has this practice to capture sequence of tests, but now that's not the original intention.

I'm really trying to help with what I experienced in this project, but if I can't understand the goal, it seems like I'm waisting everyone's time.

sarahghp commented 2 years ago

Hi everyone!

I think the intention of this thread was to capture an issue one of the project stakeholders is experiencing, so we can discuss it in the future as a maintainer group, rather than trying to solve the battle of appropriate file granularity once and for all here.

On our group agenda, we still have items for the best way to consider requests for changes and the way we want to make decicions about them. File granularity comes up often, and I think that tells us that the current solution is not the optimal one, or, if it is, we should make sure we are all on the same page about the problem it is solving and why it is not up for any change. I think we can do that, as a group, but I think the way to get there is to continue collaborating on our processes and values.

I would like to suggest that we leave this here for now, as a reminder, and keep focusing on building a robust process.

leobalter commented 2 years ago

an issue one of the project stakeholders is experiencing

All I'm asking if for clarity on what is the issue here.

I gave an example on how this problem of capture steps order is resolved today, regardless of multiple files.

so we can discuss it in the future as a maintainer group

Who can participate in this discussion?

gibson042 commented 2 years ago

Costly in what sense? Is it more costly for authors to be more precise about how tests are written, or for implementations to have ignored, skiplisted (https://github.com/v8/v8/blob/main/test/test262/test262.status or https://github.com/WebKit/WebKit/blob/main/JSTests/test262/expectations.yaml ... maybe don't look for Temporal face_with_spiral_eyes ) failures for years?

Costly in the sense that there's no good way for authors to organize files for testing that algorithm steps are executed in the right order or for maintainers to verify complete coverage for the sequence of steps, and therefore those things tend not to happen outside of trivial cases.

Look at the motivating example, String.prototype.split:

the value responsible for the requirement of each row is in bold and «-» represents any value and «"» represents the value in the same column of the preceding row

step	receiver	separator	limit	requirement
1. Let O be ? RequireObjectCoercible(this value).	`undefined` or `null`	-	-	TypeError
2.a. Let splitter be ? GetMethod(separator, @@split).	object-coercible	object-coercible for which getting property Symbol.split a) throws or b) returns a non-callable object-coercible	-	a) thrown value or b) TypeError
2.b.i. Return ? Call(splitter, separator, « O, limit »).	"	object-coercible for which property Symbol.split is callable	-	[thrown] value from `_separator_[Symbol.split](receiver, _limit_)`
3. Let S be ? ToString(O).	a) Symbol or b) object for which coercion to primitive throws or c) object for which coercion to primitive returns a Symbol	non-object-coercible or object-coercible for which property Symbol.split is `null` or `undefined`	-	a) TypeError or b) thrown value or c) TypeError
4. …let lim be ℝ(? ToUint32(limit)).	string-coercible	"	a) Symbol or BigInt or b) object for which coercion to primitive throws or c) object for which coercion to primitive returns a Symbol or BigInt	a) TypeError or b) thrown value or c) TypeError
5. Let R be ? ToString(separator).	"	a) Symbol or b) object for which coercion to primitive throws or c) object for which coercion to primitive returns a Symbol	number-coercible	a) TypeError or b) thrown value or c) TypeError

Note how the source of errors bounces around... e.g., a null receiver and a Symbol receiver can both produce a TypeError, but the latter must not be checked unless the separator is not object-coercible or has a null/undefined Symbol.split property. It is possible to check such things now, but sufficiently cumbersome that it isn't happening—String.prototype.split hasn't materially changed since ES6, but attempting to introduce this more rigorous pattern five years later still uncovered independent bugs in two implementations.

And further note that even errors from the same step can take a different form. Wouldn't it be great to eventually follow a varying but deterministic generative pattern that could cover those nested branches?

// Verify the expected error type from an arbitrary non-stringable receiver
// with independently-bad separator and limit.
receiver = generateNonStringable();
separator = generateNonStringable(new UnexpectedError("_separator_ access"));
limit = generateNonNumberable(new UnexpectedError("_limit_ access"));
assert.throws(
    receiver.errorType,
    String.prototype.call(receiver.value, separator.value, limit.value),
    "Abrupt completions are propagated from step 3: ? ToString(_O_).");

// Verify the specific code-generated error from a non-stringable receiver
// with independently-bad separator and limit.
receiver = generateNonStringable(new ExpectedError("_receiver_ access"));
separator = generateNonStringable();
limit = generateNonNumberable();
assert.throws(
    receiver.errorInstance,
    String.prototype.call(receiver.value, separator.value, limit.value),
    "Code-generated errors are propagated from step 3: ? ToString(_O_).");

tc39 / test262

Discussion thread about how to structure tests that rely on the ordering of various steps #3421

Foo.bar ( a, b, c )