Preparing test262 tests difficult for proposals under development

ptomato commented 4 years ago

In https://github.com/tc39/proposal-temporal we are developing a polyfill in tandem with the proposal text, and using the polyfill to write the test262 tests that are intended to eventually be merged into this repository when the proposal reaches the appropriate stage. The polyfill also has a Mocha- or Jasmine-style test suite which we are intending to rewrite as test262 tests over time.

I've found writing tests for the test262 suite to be much more time-consuming than writing equivalent tests in the Mocha style. The way that test262 is set up, with one test per file, (as far as I can tell) no good tools for eliminating duplicated code between files, and fewer tools for custom assertions than many other test harnesses have, is a big contrast with the Mocha style, with its describe/it/beforeEach/afterEach tools for splitting up tests into suites and easily performing setup and teardown actions. This is especially apparent since the proposal is under heavy development and it's common to move things around. For example, I've been doing some API renames lately, and I've found that when we rename an API that already has test262 tests written for it, something that ought to take minutes ends up taking hours of my time.

Another problem is that there is no good way to preload our Temporal polyfill into the environment, other than feeding it in as one giant bundle via the --prelude switch to test262-harness. (I've tried a few other things including trying to import the module sources directly, but found that this was the least worst way to load the polyfill.) Since the Temporal proposal is quite large, the polyfill is as well. I suspect that parsing and executing the giant polyfill bundle over and over again for each single test is the main reason why running our test262 suite takes many minutes, while running the much more extensive Mocha-style suite happens in under half a second.

Any ideas on how to make things easier for proposals under development?

tl;dr Here are the main points that I've run into:

Would be nice to be able to hook into the environment setup, instead of via --prelude, maybe avoid doing it for each test
Would be nice to have a way to reduce duplicated code between tests
Would be nice to have API facilities for more concise and full-featured assertions (for example, like Jasmine's spyOn() and expect(spy).toHaveBeenCalledWith(...))

leobalter commented 4 years ago

The conceptions are valid and some ideas already exist. The biggest challenge is time, IMO.

Test262 doesn't have any official test runner, but offers documentation on how to interpret tests when executing them. This happens because, at least since before I got familiar with this project in 2014, we have many different venues running Test262 and creating different reports and setting their own mechanisms to acknowledge failures, skip tests, compare, etc etc.

I'll try to simplify this in to the bullet items you described:

a hook into the env setup depends 100% in the test runner you pick. This can't be governed by Test262, as runners want to define it. Some don't even have a CLI interface, from my experience.
Tests need to be atomic, and never dependent from each other. Today we do have harness helpers that can be included in tests to avoid repetition. This is good for built-ins but quite not possible for many files testing syntax, it becomes even more complicated for forbidden syntax.
Extending the API requires immediate compliance with a good majority of different test runners. They are pretty different from each other and some are not easy, not extend the diversity in languages they are implemented. You can find some implementations using Python, one using Perl, others with Web versions, and it's a non stop increasing list. Doing it so requires one to make sure everything remains retrocompatible and the new API will be well documented and well implemented. Browser implementers, to start, won't be happy to have many new tests breaking because there is a new API requiring a change in their interpreters. I've seen some breaking changes and it's exhaustive.

My personal goal remains on implementations being able to frequently pull new tests and report execution.

I also have some further considerations:

Most tests need isolation for many aspects and it's just too hard to mitigate the too many files problem. One of them is the way test failures are reported. As I mentioned above, each forbidden syntax test requires one file. A similar case comes when you have multiple but different valid syntax for the same feature, or different ways to call a built-in. We have some tests that do combine some of those, but this also means if a single assertion fails, the whole file would be reported as a failure.

The current documentation and test harness does not offer tools to prevent an assertion failure to not explode the other tests. Different files are required to keep tests running atomic, not conflicting with others. JS Engines won't offer a way to load multiple files as independent processes and, btw, this is one of the causes why it takes a big long to run all tests. Test262 is meant to be fully run in CIs and other automated processes, etc. and for that I believe correct and non flaky tests are much more valuable than execution time.

This brings a problem where a test of a built-in method, if combined into a single file might fail for a single assertion and sometimes for the most ridiculous reason. E.g. Array#concat: there are currently 135 tests for it. If you join them, test262-harness would tell you today (2020-09-08) that V8 is not compliant, like 100% not compliant because there is one file failing and 134 files passing. If we do extract failures to separate files, could you list all the implementations passing and not passing for each part? Remember we try our best to not favor a single implementation but report as tests what we read from the specs.

There are other variations of tests that would cause a big mess up in the test reports today if we just combine files today, not limited to built-ins but also language features such as JSC and V8 being respectively 95% and 98% compliant with eval, approximately.

With that saying, from my years of experience with tests, I'm so against using spy methods to be used in tests and it makes 0 sense to me by full-featured assertions I see a request to change the API style of assertions. This is a bit of offering me a huge burn to adapt over 35k test files to match what some - not all - developers are used to in their daily basis. I only wish I could ever seen this happen in any other problem.

We can improve the API, we should have that as a goal, but we need to keep ourselves pragmatic and let the pace allow everyone else to consume this project.

If I'm not wrong, I remember someone once created some adaptor for Jest to stream the test files from Test262 and run each assertion. I believe this would be a really nice approach for proposals. I'd be very interested if we find volunteers that create adapters that read Test262 and execute tests within their favorite test runners. This would also help generating a good report.

rwaldron commented 4 years ago

That's a lot to read. The only important thing I want to say is that you're not solely responsible for writing the tests. You can lean on the maintainers here to help you. That also means that everything Leo said, holds.

ptomato commented 4 years ago

This is a bit of offering me a huge burn to adapt over 35k test files to match what some - not all - developers are used to in their daily basis. I only wish I could ever seen this happen in any other problem.

Hold on, I did not mean this as a personal attack in any way! Please understand that. I see problems that are preventing me from achieving what I want to achieve with test262 and I'm willing to help solve them, that's all. I will maybe give this thread some time to cool off and then respond in more detail.

rwaldron commented 4 years ago

I think the issue is that you see what you believe to be a problem, but which is not a problem—in fact it's all very intentional, to serve a very specific purpose: provide the granularity and resolution that engine implementers expect.

leobalter commented 4 years ago

There are 3 different things described in the issue I can summarize here:

Test262 structure could have improvements.
You're facing challenges to write tests for a proposal.
There is a suggestion of multiple approaches that include combining files, setting a new API based on existing frameworks, and introducing mocking apis.

My answer tried to comprehend each item. I understand the structure can be massive and feel overwhelming but there are more challenges if we want to make room for improvements. As I mentioned, understanding the multiple runners ecosystem is one of them.

For the challenges on writing new tests, it might become unfortunate one needs to get up to speed with the Test262 aspects to solve an occasional problem. As Rick has mentioned, we can try our best to help. Honestly I was just waiting for Temporal to get to stage 3 to go through the existing specs and tests.

The last item is the one I see personal preferences that are really not constructive and causes me exhaustion. There are many ways to solve a problem, but introducing a new style based one a niche choice, or expanding approaches with mocking apis seems like going to a project using - eg - Ember and suggesting them to use React, because it would solve a challenge I currently have.

Replacing the framework won't exactly be the solution, and whoever accepts this assumes the maintainers will be ok to work through this.

ljharb commented 4 years ago

@rwaldron the problem is that the definition of "implementers" has greatly expanded since test262's inception - from "4 well-funded large companies" to "those, plus infinite users who are intimately familiar with npm's conventions". Anything that only works for employees of the 4 browsers no longer fully meets the needs of implementors.

littledan commented 4 years ago

The experience of developers trying to upstream tests into test262 is an important problem. This applies for proposal authors, polyfill authors, engine implementers of any kind of engine (all of which Igalia does), and anyone else who wants to make upstream test262 tests. I think it would be valid to deprioritize improvements here due to limited maintainer capacity, but I don't understand the argument that there's no problem to solve.

I'm having trouble understanding what it is about the experience of engine authors @rwaldron is getting at. Isn't it possible to have multiple tests in the same file which keep running even if an earlier test fails?

However, upstreaming tests has been a sort of pain point. I tried to improve the workflow of developing test262 tests locally in V8, for later upstreaming, so that V8 team members would be able to contribute to test262 more easily, but this code was recently removed as it was unused. I think that improving the test contributor experience could help us get more tests into test262, improving coverage further, and @ptomato 's feedback is an important part of that.

leobalter commented 4 years ago

but I don't understand the argument that there's no problem to solve.

Who said that?

leobalter commented 4 years ago

Ok, I asked my removal of the admin group for Test262 and I no longer intend to maintain this project or insist repeating the same answers for things we once had goals in good faith for the long term.

I will certainly continue to write tests for proposals I'm involved and I'm available to support individuals who need help with specific help on contributing to this project, but preferably outside of this public facing.

I'm doing this to give a chance for those who insist I'm wrong or as was once called a gatekeeper. I'm tired and my participation here today remained because of my joy on contributing and making an impact.

To the people who once in the past asked others to remove me from this project, today you win.

rwaldron commented 4 years ago

I don't understand the argument that there's no problem to solve.

Cool, because no one said that. I said that the thing @ptomato is describing as a problem, ie. one test per file, is intentional. If @ptomato opens a PR with all the tests in one file and states up front that they don't have the resources, time, bandwidth, whatever, to make more granular tests, I would 100% will work with that. The end result would definitely be smaller tests broken up into different files, but it would be easier for @ptomato to contribute. This is the same policy we've had for years about spec steps in meta data, feature tags, etc. Any time someone says "X is a hardship", the response (from me anyway), is "give me what you have, I'll help you get it done".

Isn't it possible to have multiple tests in the same file which keep running even if an earlier test fails?

Sure, there's nothing about Test262 that would prevent someone from creating a runner that consumed concatenated test files, with each wrapped in a try/catch—but tests in Test262 should be limited to testing only one thing per file, to allow for the most flexibility for all consumers of Test262 (this is partially a response to @ljharb as well).

@ljharb the same argument you've just made could be applied to the specification itself, but surely you're not going to simplify that document into something less specific and intentional ;)

devsnek commented 4 years ago

Just from an implementer perspective, I think a mocha-like setup seems objectively worse in at least two regards:

Each individual test should only invoke javascript that is needed for that test.
Each individual test should be able to run in a clean slate environment.

Also @ptomato some of the issues in your OP (taking a long time, --prelude being weird) seem inherent to test262-harness, not test262.

ptomato commented 4 years ago

Hi! I'd like to start out by saying that one new thing I have learned from this thread is that it's not necessary to use test262-harness with test262. Thank you for clarifying that. So I will hopefully be able to solve at least one of my problems (the long execution time caused by test262-harness parsing and executing the Temporal polyfill for every test) with a custom harness for Temporal.

That out of the way, I'd like to apologize for my role in how this thread escalated. I meant no harm, but I made a mistake regarding this:

This is a bit of offering me a huge burn to adapt over 35k test files to match what some - not all - developers are used to in their daily basis. I only wish I could ever seen this happen in any other problem.

Hold on, I did not mean this as a personal attack in any way!

Others pointed out to me privately that I interpreted the word "burn" wrong (I had read it to mean "insult".) I read it and thought, "uh oh, I've offended someone, I'd better fix that ASAP and deal with the rest later" and inadvertently escalated an already-tense situation. I realize it was frustrating that that came across as if I was only reacting to this one sentence and not to the technical constraints that Leo helpfully summarized.

I also wasn't clear about something in my original message: I'm offering to contribute and make any improvements we agree upon, myself. I regret that I didn't take more care to say this, because I came across as criticizing and pushing burden on the maintainers without offering any help.

I do appreciate having the background on the technical constraints. I only want to make improvements that meet the requirements of the project, its maintainers, and its consumers.

Another clarification: My original message over-emphasized Mocha, which made it look like I was a Mocha fan who just wanted to change everything to my preferred style. The reason I mentioned it was not because I thought we should rewrite the tests in Mocha style, but rather as an example of an environment in which I was quickly able to be productive when I first used it. Going forward, I'll make sure to focus more on specific improvements without confusing the issue.

So, now that some time has passed, below I'd like to try to briefly engage with the technical constraints that Leo summarized.

Tests need to be atomic, and never dependent from each other.

I appreciate this clarification. A new environment is constructed every time for each file to ensure that there is no environment leakage, and that matches my understanding. Other harnesses (like Mocha) set up and tear down things on the same global object which means that there can be interdependencies if someone sticks data onto the global object, and that wouldn't be appropriate for test262.

Today we do have harness helpers that can be included in tests to avoid repetition. This is good for built-ins but quite not possible for many files testing syntax, it becomes even more complicated for forbidden syntax.

I would guess this is true for some, but not all files? In any case I fully agree with the constraint, that tests need to be able to test wrong syntax without blocking other tests from being executed or their results from being reported.

Extending the API requires immediate compliance with a good majority of different test runners. They are pretty different from each other and some are not easy, not extend the diversity in languages they are implemented. You can find some implementations using Python, one using Perl, others with Web versions, and it's a non stop increasing list. Doing it so requires one to make sure everything remains retrocompatible and the new API will be well documented and well implemented. Browser implementers, to start, won't be happy to have many new tests breaking because there is a new API requiring a change in their interpreters. I've seen some breaking changes and it's exhaustive. My personal goal remains on implementations being able to frequently pull new tests and report execution.

Given this goal, it seems to me that one way to make an infrastructure change work, is to ensure any changes to the format are supported in the implementations' runners first, before merging those improvements into test262. As someone intending to contribute for the first time to the test262 infrastructure, I don't know the full extent of the work that's required, but this explanation has been helpful for me to learn more. I understand it wouldn't be easy, but maybe there's a tradeoff to be made here with developer ergonomics, where some features would be worth the effort?

The current documentation and test harness does not offer tools to prevent an assertion failure to not explode the other tests.

I see that there's not currently a way to do it, and I agree that it's a top priority to keep the current granularity of reporting! I think it ought to be possible, if we add support in all the consumers of test262, to have a file format where you can have multiple tests that are each run independently of each other, but with shared setup and teardown code. But I understand that it isn't just concatenating all the files together and hoping it will run the same :smile: I've done something similar to this in another project using a DSL (link); happy to discuss this further if it could be interesting for test262.

Test262 is meant to be fully run in CIs and other automated processes, etc. and for that I believe correct and non flaky tests are much more valuable than execution time.

I agree that correct, non-flaky tests are a top priority as well. I do consider execution time important even on CIs, as it's been a bottleneck in Temporal, and I believe that there must be ways to reduce the execution time without compromising correctness. That said, I think my specific concern here is probably obsolete, because it can be solved with a custom test harness for Temporal.

tc39 / test262

Preparing test262 tests difficult for proposals under development #2756