Coding standards for test262

littledan commented 7 years ago

I'm trying to get more of the V8 team involved in writing test262 tests (see https://docs.google.com/document/d/16bj7AIDgZLv4WOsUEzQ5NzcEN9_xo095e88Pz8FC5rA/edit# for more about the vision). One disadvantage that I've heard from multiple V8 developers is that it's much harder to write test262 tests that meet the patterns in recent test262 additions, compared to the conventions of V8's "mjusnit" tests. These conventions of test262 include:

Particular naming conventions for the files in a directory
Each file containing just a small set of assertions, rather than a file testing many things together

Both of these come up in the review for #846 . I appreciate the work that's gone into making the current test suite meet a high standard of consistency and completeness, but I wonder if it might be worth considering relaxing some of these requirements to be more friendly to contributors.

For me, as a test user (doing test262 rolls in V8 and filing/triaging bugs that are found), these patterns don't really affect my benefit of test262 much. But they add a lot of overhead and fear for people who might be porting tests to test262. As a user, I don't have a great way of traversing the naming scheme to find if there's a test for a particular thing (is the naming scheme documented somewhere to the point where I should be able to do this?), for example. Breaking up tests into lots of little files sometimes helps some to narrow down failures more easily, but it's generally possible to find the issue by executing parts of the test individually with local modifications, and it's often necessary to do so when debugging anyway since even a single small test often has multiple small assertions and could fail in multiple ways.

Maybe the conventions have more value for maintainability of tests, though. I just want to open this question as a cost-benefit equation.

cc @gsathya

caitp commented 7 years ago

I think it's awkward to write multiple async tests in the same file. There are some examples, like the async generator runtime tests that have already landed, which do assertions about order of evaluation. In those cases, there isn't anything that can be done about it really, but for #846 it makes a lot of sense to do one test at a time, sure to the lack of a RunMicrotasks() helper to keep the test flow looking synchronous / easier to read.

It's convenient that for v8 or jsc tests, there can be many in the same file, but it doesn't really matter that much, imo.

jugglinmike commented 7 years ago

For me, as a test user (doing test262 rolls in V8 and filing/triaging bugs that are found), these patterns don't really affect my benefit of test262 much.

I don't doubt it. From my perspective as a maintainer, I can say that structure helps very much. I think the difference comes down to goals.

Contributors who are working on VMs have their eyes set on correctness of their product, and tests help them verify that. For example, Cait's original set of tests were focused on the problem she was fixing in V8.

Maintainers have a responsibility to make sure the specification is covered consistently. Test262 easily the largest project I've worked with personally, and it still doesn't cover everything. That's why after identifying a missing test case, I'm not satisfied adding one specific test and calling it a day. The omission probably reflects a larger problem, and I'd like to address it "completely" in terms of the specification language. One of the more frustrating experiences is when I attempt to answer this question and am met with a directory full of arbitrarily-named files. I'm forced to build up an intuition about the current coverage state (where in many cases, it doesn't seem like this was a concern even of the tests' original author). I use long, structured file names to document the coverage area and to demonstrate its completeness (more on this below). Likewise, test meta-data forces me to explain my intentions and to prove that they are based on the specification.

But they add a lot of overhead and fear for people who might be porting tests to test262.

They certainly add a lot of overhead, and I sympathize. I first got involved with this project by importing MJSUnit tests, and my initial contribution required a fair amount of back-and-forth. But because I see the value in the work, I think it's a practice worth keeping. "Fear" is more concerning but also a little vague. Would it be fair to say your teammates are intimidated by the process?

As a user, I don't have a great way of traversing the naming scheme to find if there's a test for a particular thing (is the naming scheme documented somewhere to the point where I should be able to do this?), for example

I wish I had some easily repeatable set of instructions for assigning test file names, one that could also be used to locate files based on the goal of contributor. But that may be a pipe dream, as it seems equivalent to formally determining test coverage over the entire specification, including all possible interactions.

I can try to describe my process in general terms, though I don't know how useful this will be. For a given test:

identify each detail that influences the behavior under test
enumerate all the alternative expressions of the detail,
assign some short identifier for each
(finally) use that scheme to name files that exhaust all possible combinations

Breaking up tests into lots of little files sometimes helps some to narrow down failures more easily, but it's generally possible to find the issue by executing parts of the test individually with local modifications, and it's often necessary to do so when debugging anyway since even a single small test often has multiple small assertions and could fail in multiple ways.

I would be open to this. When you're being very thorough, the distinction between individual tests gets pretty minute, and it can be cumbersome to describe it accurately. That's where you get such classic descriptions as, "Resolving with a resolved Promise instance whose then method has been overridden from a pending promise that is later fulfilled". The more pragmatic approach is to use a more generic description for all your test files, but that limits the usefulness of having multiple files in the first place. I wonder what @leobalter, @tcare, and @bterlson have to say...

leobalter commented 7 years ago

The files structure is still not the best solution to confirm coverage.

If we tackle conventional improvements, my motivation is mostly a better coverage check. This is my subjective point of view considering the time I spend reviewing code is the one I really want to improve.

We have tests requiring isolated files, but I'm in to experiment more tests cases in a single file. We need to establish a good conventional grouping for different aspects and a special attention for atomic/independent tests but I'm in to try it.

One thing that we would currently miss in the current model of assertions we have: a failure causes an abrupt completion. If we group files, an early failure would prevent everything else to run from that point. This is one of the reasons I love QUnit, a test framework that doesn't throw on failed assertions.

Distant from other tests frameworks, we don't have any form of test functions to group assertions, this would also prevent a proper report. Creating one from scratch is not really a good idea, considering we need to consider async.

domenic commented 6 years ago

I think coverage is best served by actually instrumenting engines (or perhaps anba's reference implementation). That seems way better than trying to do some kind of match-up between spec and tests.

tc39 / test262

Coding standards for test262 #854