How do we improve contributions?

leobalter commented 7 years ago

As a maintainer of this project, I've been informed of feedback from implementors that Test262 is too complex to contribute for.

This is something I want to ask on what should be improved and how.

My vision on this project is certainly biased as I'm already used to it, but I believe we should improve the project's documentation, specially for the frontmatter and the project structure, where we most need to address fixes in the PRs processes.

The complexity of this project might also come to the fact Test262 is a test suite consumed by many players, from browser engines to transpilers and parsers, we need to maintain a functional communication from out interface to all of them.

This also means that every moment we relax on other contributions, we get feedback from other consumers - if not from the same engine or library, that now something is wrong.

Some work might seem too much, when they are actually providing tools to enhance the use of this suite at large.

Beyond the external feedback, I also have seen issues in PRs, e.g.: the contributor do not engage in addressing review feedback, most of the time abandoning their own work, other times just rejecting everything. For most of these cases maintainers volunteer to land each work as it is to fix it later, usually in our own open source time (aka non work time).

The rigor is still not even close to what we have at the specs themselves.

How do we improve Test262 so it becomes easier to contribute to, but not as a trade off for the quality?

ljharb commented 7 years ago

Remove the redundant boilerplate (i think that's "frontmatter"?) from every test file. Picking some at random: https://github.com/tc39/test262/blob/master/test/built-ins/ArrayIteratorPrototype/next/Float64Array.js#L1-L2 https://github.com/tc39/test262/blob/master/test/built-ins/Array/15.4.5.1-5-1.js#L1-L2
automated tests currently only run a linter; it'd be nice if there was some way to assign a reference implementation to test files, so they could actually be exercised against it.

azz commented 7 years ago

As someone has contributed once to this project, I found it OK to get started with, but I found referring to existing tests a little too much. Perhaps adding single test generator that prompts you through the kind of questions you need to think about when writing a test.

esid
description
features
org/company/contributor name
negative test case?
etc, etc.

Maybe a yeoman generator or just a Inquirer.js prompt?

leobalter commented 7 years ago

@ljharb I really want to remove this copyright header from every test file. That's not part of the frontmatter, but maybe you're also commenting on more items in the header?

leobalter commented 7 years ago

Maybe a yeoman generator or just a Inquirer.js prompt?

by many times I thought of this and it might be solved by an independent npm package just as test262-harness and linked from Test262 as a helper tool. Thanks for this idea!

domenic commented 7 years ago

I think this is a matter of priorities. To me, the most important thing for a testing project is getting tests. Those tests don't have to be "nice" or "pretty", but they have to be "good". All that means is that they test something that is of interest to consumers of the test suite.

test262 has a lot of rules that make it reject "good" tests because they are not "pretty" enough. These prevent contributions. I personally do not feel comfortable contributing to test262, whereas I am one of the most prolific contributors to web-platform-tests, a similar effort for web specifications.

The difference is that all that is required in web platform tests is that your tests are good and they are reviewed by one other person familiar with the subject matter for correctness. This bears emphasizing: there are no gatekeepers; any reviewer suffices. You just need a quick check that the test is reasonable. You don't need to follow any other rules.

Concretely for test262, this would mean to me:

Removing any frontmatter or other metadata requirement. This is crucial. The burden should be on consumers that need this metadata to curate it, not on test writers who don't need it and are just trying to contribute tests.
Very course-grained folder structure where it is obvious where to put tests. No bikeshedding on file names. This seems OK today from a quick browse; I don't see many folders more than two levels deep.
Allowing anyone to be a reviewer, and not a blessed set of gatekeepers.
Reviewer feedback should be focused on test correctness. Any stylistic or clarification feedback should be optional and not block merging. The only time a reviewer should not say "LGTM" is when the test is incorrect with respect to the spec. Otherwise they should say "LGTM" or "LGTM with some suggestions". The test author can then accept these suggestions, or ignore them. Either way, we now have new test coverage!!!

Tests are not specs. They can be useful even if they are not pretty. Being rigorous in individual tests should not be a goal. "Quality" should be measured by correct tests, not pretty tests, or tests with metadata that some subset of the consumer population cares about.

syg commented 7 years ago

The only frontmatter I have found truly useful is a feature list. It's useful because it's actionable, so that implementations can quickly skip groups of tests across various subdirs. The rest is rather onerous, and I agree with Domenic here. The bit about finding the right section of spec text to copy/paste is especially off-putting to contribution.

gsathya commented 7 years ago

rwaldron commented 7 years ago

@gsathya yes, related: https://github.com/tc39/test262/issues/854#issuecomment-280880594

leobalter commented 7 years ago

@domenic Removing any frontmatter or other metadata requirement. This is crucial. The burden should be on consumers that need this metadata to curate it, not on test writers who don't need it and are just trying to contribute tests.

I'm afraid this doesn't match the feedback from V8 we got on #854 for at least the features, as @syg also said here. We've been working to improve the features metadata and this is getting positive feedback from different consumers.

@syg The bit about finding the right section of spec text to copy/paste is especially off-putting to contribution.

This brings a high cost for long term maintenance. Tests are valid if they match a documentation reference (the specs), if we don't set this reference, they might be pretty confusing to even tell why they even exist.

I'm interested to find a point where Test262 is better for maintenance and consumption, as I said before, having it properly for all the consumers is not really easy.

leobalter commented 7 years ago

Please don't consider my answers as final. I'm planning to write a full report on the project, its historic decisions and evolution, addressing all the feedback from here. It's also unfair to just answer anything here without an actionable plan.

syg commented 7 years ago

This brings a high cost for long term maintenance. Tests are valid if they match a documentation reference (the specs), if we don't set this reference, they might be pretty confusing to even tell why they even exist.

Are you saying currently, someone manually audits tests to see if they're still relevant and refer to the right sections of the spec? That must be a massive amount of labor and doesn't seem scalable. That aside, it's also brittle. Spec refactorings happen all the time that would make the copied text in the tests stale.

domenic commented 7 years ago

Yes, I think if people are interested in doing that, they should not put the burden on test authors. They should maintain their own external metadata mapping which those of us who just want useful tests can be unburdened by.

bterlson commented 7 years ago

Are we agreed at least that a useful test should describe what its trying to do? I find that I need descriptions quite a bit. File names are a poor substitute. I'd argue strongly for keeping both description and features, but I think the rest are fine as optional. Also means we don't have to care at all about file names.

domenic commented 7 years ago

I think of such descriptions like I think of code comments. They are necessary if it's not clear from the code, perhaps because you're testing something very strange, or because you need to convey the "why" of what you're testing. They should never repeat the "what" of what you are testing.

See also https://blog.codinghorror.com/coding-without-comments/ , which points out how often it's better to just use good variable names, etc., instead of having to write a "what is this doing?" comment.

Even if the test is hard to understand, I don't think comments/descriptions should be mandatory. A test is still useful if you can run it and see that it fails in your engine but passes in others. Ideally a reviewer familiar with the subject matter will give suggestions asking for more comments, when appropriate, but it doesn't need to be enforced more so than any other good code review hygiene.

domenic commented 7 years ago

I'm afraid this doesn't match the feedback from V8 we got on #854 for at least the features

I don't see any such feedback from V8 team members (much less "V8" as an entity) in that thread. The word "feature" doesn't appear, and the two V8 team members in that thread are talking about how the current system makes it hard, especially with regard to the single-file-per-test structure.

littledan commented 7 years ago

I agree with many of the comments earlier on this thread. In particular,

I find it onerous to include the relevant spec sections under info when writing tests, and have not found this useful when debugging test failures. When a failure happens, I typically have to read a much broader section of spec text. I don't see how it's possible to do code reviews without looking at a broader section either.
Some sort of description/comments are a useful clue when starting to debug a failure or review a test. Since we already have a description frontmatter piece, I don't see a strong point for removing it, but if someone wants to write the description in comments in the code instead, I also don't see the harm. Maybe they aren't needed for some really obvious tests, though; definitely copy-pasted descriptions for the sake of having a description is not useful, and I don't see the value in working hard to write a description within any particular editorial constraints (such as being around one line/sentence).
I am very unhappy about the removals of certain tests for "redundancy", e.g., here. These were tests written by Google engineers as part of the original tests that went into test262. They test that the global object follows certain behavior that other objects have. Due to how the web platform works, the global object will always have some special support code in the JS engine--it is really useful to have tests like this. I can only imagine that they were failing in some browsers in the past. Removing correct tests decreases the quality of the test suite and risks regressions in implementations. In general, it would be great if we could upstream even more regression tests, as an error in one implementation is more likely to be an error in an other implementation. Duplicate tests can be removed by maintainers later, as a low-priority background task, but in general, the cost of a duplicate test is pretty low (just means it takes longer for things to run).
Generally, in contributing to test262, I find myself worrying a lot about whether something that I want to test will be deemed out of scope or not formally correct enough, as the above tests are. This is a big drain, and it makes me want to contribute tests to another place (as well as move the deleted tests there), though I haven't gotten around to that.
Frontmatter does have some meaning and utility. As noted earlier, features are broadly useful for many implementations. includes, flags and expected success/failure are important for the operation of the tests themselves. Somehow, it seems that these features are difficult for people to learn about for new test contributors; I'm not sure what could be done to make them more friendly.
Filenames which meet current guidelines are pretty hard to think of, and they have also been hard for me to understand and not useful when trying to understand what a test does. I think it's a fine thing if some test authors want to follow certain conventions when writing their file names, but it's a barrier for contributors without providing much value for users.

It seems like current test262 reviews focus on the aspects that we're discussing in this thread as potentially not so important. What is sometimes missing from reviews is a close look at correctness and identification of further areas to test. Frequently, when trying to run new test262 versions on V8, I encounter incorrect tests. At the same time, the V8 team has found issues where the test262 tests had a big blind spot not identified by review.

When trying to improve test completeness, it seems like there's disproportionate attention put on steps which cast types and check the names and lengths of functions. These are useful to have, but not enough when trying to get at completeness. The review which adds the most value would be based on a detailed understanding of the specification and its edge cases, and check that each is hit, or that they are planned to be hit elsewhere in the test plan, or otherwise comment that it would be nice to test those cases in a follow-on patch.

For most of these cases maintainers volunteer to land each work as it is to fix it later, usually in our own open source time (aka non work time).

Let's separate the two questions of resources for test262 maintenance from the policy that we'll take for coding standards. For example, I have commit permissions in test262, and frequently read incoming patches, but I've been holding back on merging patches not LGTM'd by others on the advice of other test262 maintainers, as I have a somewhat more accepting view on some of the issues under discussion here.

@ljharb I really want to remove this copyright header from every test file. That's not part of the frontmatter, but maybe you're also commenting on more items in the header?

I don't think removing copyright notices is a very productive path for test262 right now. Writing copyright headers is easy--you just copy-paste an existing one and put your name in. @leobalter has brought this issue up at TC39, and the response from @bterlson was that this would require getting laywers involved and isn't worth it. @bterlson has already spent a bunch of time with lawyers getting test262 to a state where it can accept contributions, and the copyright line is part of that. Many open source projects use a copyright header, and developers tend to be used to it. This is just not worth our time to look into further. By contrast, the other frontmatter aspects are both self-imposed and more work to write.

I don't see any such feedback from V8 team members (much less "V8" as an entity) in that thread.

Not sure where it comes up, but I've definitely asked for more feature tags, and made more use of them in V8's test262 runner. Anyway, whether we should have feature tags in tests is separate from who should write them--we could allow tests to be committed which skip the tags, and let others come along later to repair them.

automated tests currently only run a linter; it'd be nice if there was some way to assign a reference implementation to test files, so they could actually be exercised against it.

There's no JS reference implementation, so I have no idea how we can do this.

Very course-grained folder structure where it is obvious where to put tests. No bikeshedding on file names. This seems OK today from a quick browse; I don't see many folders more than two levels deep.

The folder structure is definitely more than two levels deep. For tests for functions/methods, the folder structure makes a ton of sense and works well. For tests on the grammar, it's sometimes a little less clear where to put things, and if people put tests out for review which put them in the wrong place, maybe we should not bother nit-picking too much.

Allowing anyone to be a reviewer, and not a blessed set of gatekeepers.

It's important that we maintain correctness for the tests, otherwise it's a burden/source of confusion for test users, but I agree that we could broaden the set of maintainers. A radical solution, which would be very nice to have in practice, would be to adopt two-way sync, as has been done for Web Platform Tests. This means that you allow reviewers for any implementation to review tests, and they will be automatically uploaded to test262 without any further review. I've been jealous of two-way sync for a while, but I don't see how we could square this with the current code review culture that we have in test262. Two-way sync has lead to significantly more test contributions from Chrome and Mozilla; by contrast, we have very little test262 contributions from browser vendors today. cc @foolip who has worked on two-way sync in Chrome.

rwaldron commented 7 years ago

I find it onerous to include the relevant spec sections under info when writing tests, and have not found this useful when debugging test failures. When a failure happens, I typically have to read a much broader section of spec text. I don't see how it's possible to do code reviews without looking at a broader section either.

This is and has always been optional. Edit I happened to come across this example while reviewing a PR for BigInt tests: https://github.com/tc39/test262/pull/1251/files#r141931873

Some sort of description/comments are a useful clue when starting to debug a failure or review a test. Since we already have a description frontmatter piece, I don't see a strong point for removing it, but if someone wants to write the description in comments in the code instead, I also don't see the harm.

Also, not an issue: go for it. The "description" metadata should only be used as a short explanation of what's being tested. I'm surprised that anyone thinks writing a single sentence is a burden.

definitely copy-pasted descriptions for the sake of having a description is not useful,

I agree, authors should make an effort to write better, more specific summary sentences.

I am very unhappy about the removals of certain tests for "redundancy" ... Due to how the web platform works, the global object will always have some special support code in the JS engine--it is really useful to have tests like this. I can only imagine that they were failing in some browsers in the past ...

A reasonable solution to this would be to restore the tests and add a flag "browser" (or similar) that would communicate to test runners that they should not run this test unless the host is a browser.

Generally, in contributing to test262, I find myself worrying a lot about whether something that I want to test will be deemed out of scope

I avoid this by only writing tests that are based on the normative specification.

Somehow, it seems that these features are difficult for people to learn about for new test contributors; I'm not sure what could be done to make them more friendly.

I believe that the experienced contributors should just tell the new contributors what is missing and why that thing is relevant, and then they can make the necessary changes. This is what I've always done, and will continue to do for all new contributors.

Filenames which meet current guidelines are pretty hard to think of

These "current guidelines"? "Test cases should be created in files that are named to identify the feature or API that's being tested." just means "Don't name them 'S10.1.6_A1_T3.js'"

What is sometimes missing from reviews is a close look at correctness and identification of further areas to test.

Can you give some examples of such reviews, I think that would help me to improve as a reviewer and contributor.

Frequently, when trying to run new test262 versions on V8, I encounter incorrect tests.

How frequently? Are you reporting these incorrect tests?

At the same time, the V8 team has found issues where the test262 tests had a big blind spot not identified by review.

Paradoxically, Test262 maintainers have been told that they should be more flexible in accepting contributions that do not provide complete coverage, therefore allowing "blind spots". When these are reported, are the reports ignored? If so, can you point to examples?

When trying to improve test completeness, it seems like there's disproportionate attention put on steps which cast types and check the names and lengths of functions.

Anything that's defined in the spec and is observable from user code must be tested.

The review which adds the most value would be based on a detailed understanding of the specification and its edge cases, and check that each is hit, or that they are planned to be hit elsewhere in the test plan, or otherwise comment that it would be nice to test those cases in a follow-on patch.

That sounds like an apt description of the reviews one can expect to get when contributing to Test262. I think it's also important to keep in mind that there is a substantial time and resource requirement for these reviews.

whether we should have feature tags in tests is separate from who should write them--we could allow tests to be committed which skip the tags, and let others come along later to repair them.

I fully agree with this.

For tests on the grammar, it's sometimes a little less clear where to put things, and if people put tests out for review which put them in the wrong place, maybe we should not bother nit-picking too much.

Also agree—if there is no other issue with a contribution, then moving the file shouldn't be a blocker (the reviewer/maintainer can do it after). If there are other changes to make, it's not a hardship to also move a file.

foolip commented 7 years ago

A radical solution, which would be very nice to have in practice, would be to adopt two-way sync, as has been done for Web Platform Tests. This means that you allow reviewers for any implementation to review tests, and they will be automatically uploaded to test262 without any further review. I've been jealous of two-way sync for a while, but I don't see how we could square this with the current code review culture that we have in test262.

I've never been involved with ECMAScript standardization or test262, but just based on what we've seen in wpt, I think people who do work on JS engines would benefit from two-way test sync as well. Some WIP documentation of Chromium's system by @Hexcles is available: Blink WPT Sync Workflow

Two-way sync has lead to significantly more test contributions from Chrome and Mozilla; by contrast, we have very little test262 contributions from browser vendors today. cc @foolip who has worked on two-way sync in Chrome.

That has been my impression too, and I just did some commit counting and sent Has two-way sync increased contributions? Yes! to the ecosystem-infra mailing list. Tripling the commit count is better than I would have guessed :)

syg commented 7 years ago

I've never thought about a 2-way sync before. Putting on my ex-implementer hat, the more I think about it the more I like it.

rwaldron commented 7 years ago

@foolip thanks for the suggestion!

rossberg commented 7 years ago

@foolip, before you jump to conclusions, there are many reasons why engine writers do not contribute more of their tests to test262, I doubt that the lack of automatic sync is even among the top 10. In particular, the vast majority of tests that engine writers produce are not conformance tests but engine-specific white box tests that exercise engine-specific code paths and optimisation pipelines in multitudes of engine-specific ways (and often are regression tests). In many cases (at least for V8) such tests even use non-standard magic for set-up. It would simply not work to auto-sync those.

littledan commented 7 years ago

@rossberg-chromium I'm not sure if that's true. I haven't been the most prolific V8 developer, but many of the tests I've checked into V8 are conformance tests, getting at edge cases that test262 might miss since I'm trying to exercise the corner cases of my implementation.

You'd expect regression tests to be the worst here in terms of testing implementation details, right? Just looking at V8's mjsunit tests in the regress subdirectory, 792 out of 1746 tests don't contain a % or set command-line flags, which are the main mechanisms used non-standard magic setup. I opened a few of these tests at random, and all but one seems valid in any engine (the remaining one is testing the internals of the shell). If I were developing another JS engine, I'd love to get 792 free tests of edge cases that were difficult to get right elsewhere.

For V8 developers to contribute to test262, I made a manual process , but it's gotten extremely little use. I don't think test262 code review is the limiting factor, since so few tests have even been attempted. But maybe you have other suggestions to increase use.

ajklein commented 7 years ago

@rossberg-chromium given my experience in Blink, where most developers used to write engine-specific tests and switched over time to writing web-platform-tests with two-way sync, I think there's some empirical evidence that two-way sync provides a valuable path for committing conformance tests to a central repository. I'd be interested in seeing if we can move test262 more in that direction.

rossberg commented 7 years ago

On 10 October 2017 at 17:41, Daniel Ehrenberg notifications@github.com wrote:

If I were developing another JS engine, I'd love to get 792 free tests of edge cases that were difficult to get right elsewhere.

It is not free, though, because somebody will have to start doing the extra work of labelling/sorting tests appropriately and building the infrastructure to separate them from others for the purpose of syncing. And various other mishaps can arise.

Also, you absolutely do not want your own regression tests to be sync'ed in two directions, because regression tests normally should never get modified or deleted.

littledan commented 7 years ago

This is and has always been optional. Edit I happened to come across this example while reviewing a PR for BigInt tests: https://github.com/tc39/test262/pull/1251/files#r141931873

That's a relief! Maybe this could be made more clear in code reviews, e.g., by saying "Optional: You could include a spec snippet here". (You see me commenting on the spec snippet; I was just trying to be helpful within my understanding of the policy, and in this thread we're discussing the policy.) I'll stop including spec snippets in my tests in this case.

I am very unhappy about the removals of certain tests for "redundancy" ... Due to how the web platform works, the global object will always have some special support code in the JS engine--it is really useful to have tests like this. I can only imagine that they were failing in some browsers in the past ...

A reasonable solution to this would be to restore the tests and add a flag "browser" (or similar) that would communicate to test runners that they should not run this test unless the host is a browser.

I don't quite understand. Those tests are valid regardless of embedding environment. The only maybe-contentious part was defining a global variable, but tons of tests (and the infrastructure) assume that they can do that too. Why bother with a tag?

Generally, in contributing to test262, I find myself worrying a lot about whether something that I want to test will be deemed out of scope

I avoid this by only writing tests that are based on the normative specification.

Lots of tests are deemed out of scope even though the normative specification implies them. For example, those tests that I linked above with the removal.

There are also gray areas, such as tests about Intl which test things based on particular locales. It'd be nice to share these between implementations. You can feature-test for a locale being present, but you can't normatively guarantee that, if a locale is present, it will behave in a certain way. However, without this sort of thing, it's hard to write anywhere near decent tests for Intl. It'd be great to have a place to share tests where we can use fuzzy judgements about whether it's probably useful and probably correct sense across changes to locale data over time.

I believe that the experienced contributors should just tell the new contributors what is missing and why that thing is relevant, and then they can make the necessary changes. This is what I've always done, and will continue to do for all new contributors.

Some contributors have had the experience of being surprised and confused by what came up in code reviews. It's great that we have people like you doing reviews, and I don't have a great idea in mind of what to change (apart from thinking carefully about which requirements are really important).

These "current guidelines"? "Test cases should be created in files that are named to identify the feature or API that's being tested." just means "Don't name them 'S10.1.6_A1_T3.js'"

I guess I falsely felt some kind of pressure when writing tests to choose a really good name for the file, and other contributors have expressed similar concerns to me. If the guidelines are that relaxed, then that's a relief.

How frequently? Are you reporting these incorrect tests?

Yes, I file bugs for it, and most of them get fixed promptly. You can see the bugs I've filed here, though some are coverage bugs.

Paradoxically, Test262 maintainers have been told that they should be more flexible in accepting contributions that do not provide complete coverage, therefore allowing "blind spots". When these are reported, are the reports ignored? If so, can you point to examples?

Actually, I don't think there's a contradiction between accepting partial test contributions and identifying areas of omission. You can LGTM a patch for being correct and document the omissions for future test authors at the same time.

For a mixed example, in this review, the reviewer identified some good areas to cover, which is a good model for what I was suggesting, but the PR missed computed property names, discovered only later. No one is perfect, but the ideal, magical test262 review could notice that sort of thing. (Funny story for two-way sync fans: A test for that case had been checked into V8 months earlier.)

When trying to improve test completeness, it seems like there's disproportionate attention put on steps which cast types and check the names and lengths of functions.

Anything that's defined in the spec and is observable from user code must be tested.

I agree that we should have tests that are as complete as possible (though I don't know if we can ever satisfy it--things have unbounded interactions with each other, and this is just one of those interactions). I'm not saying it shouldn't be tested, it's more that I've seen these as a main focus of some reviews, when there are also other areas where I haven't seen as much attention.

I'd be surprised if so much attention is paid to these aspects in tests for other parts of the web platform that have the benefit of WebIDL and generated APIs based on a "header file" format. But given how JavaScript is more complicated and disorganized when it comes to these casts and things, it might be justified here in a way that's not there.

That sounds like an apt description of the reviews one can expect to get when contributing to Test262. I think it's also important to keep in mind that there is a substantial time and resource requirement for these reviews.

I'm not saying that it's possible to do better than the reviews we have now with current maintainer resources, only saying that this would be the most useful thing to contribute to the extent that maintainers are available.

thejoshwolfe commented 7 years ago

I have some ideas about how to make the contributing document less intimidating. Perhaps that can be a separate discussion thread.

leobalter commented 7 years ago

Thanks for the feedback, everyone.

@rwaldron and I wrote a report that is now published on the project's wiki including a general plan for further improvements.

https://github.com/tc39/test262/wiki/Test262-Technical-Rationale-Report,-October-2017

@ajklein: ** I'd be interested in seeing if we can move test262 more in that direction.

Sure, this seems very interesting and I'd love to follow up about this! We appreciate the collaboration from all of the project targets. Although, we need to figure out how to make two way sync beneficial for all consumers of Test262.

domenic commented 7 years ago

It's a shame this ended the way it did, with a large document which seems geared toward justifying the current structure and maintainers' preferences without taking into account the desire for change from several participants on this thread. Oh well.

thejoshwolfe commented 7 years ago

It sounds like perhaps Test262 is not the place for two-way synced tests. Perhaps there could be another project which facilitates sharing javascript tests between consumers. It sounds like Test262 wants to stay close to the spec, which means it can't provide tests for implementation-defined behavior that is extremely common among implementations.

Here's an example of a test that depends on non-standard behavior: https://cs.chromium.org/chromium/src/v8/test/mjsunit/regress/regress-707066.js . The non-standard behavior is that a stack overflow is thrown as an exception. That seems like a useful test to share between implementations that share that non-standard behavior, but it doesn't fit Test262's vision of staying close to the spec.

Perhaps the vision that @domenic, @foolip, @ajklein, etc. have for a quantity-over-prettiness collection of tests is still a good one, but should be pursued separately from Test262's spec-oriented approach.

littledan commented 7 years ago

Although, we need to figure out how to make two way sync beneficial for all consumers of Test262.

If you start with two-way sync between certain browsers and test262, then all consumers can benefit. This is actually the current state of two-way sync for wpt in browsers--only some browsers (Mozilla and Chrome IIRC) have two-way sync infrastructure in place, but all benefit by having more tests.

Your document spends a while talking about test262 as a test framework for all things in the JavaScript ecosystem, not just full implementations capable of being embedded in browsers. I think that's a great goal. However, to the extent that it conflicts with as-complete-as-possible testing in browsers, I hope we can work through these issues and think about things as a cost-benefit analysis. I think native implementations are very important, and we should have really good testing somewhere. If it needs to live outside of test262, maybe that's OK, but I like the current setup where you don't have to straddle multiple repositories when writing and running tests. JS engines are set up pretty well to run test262 tests, and it would be a pain to hook up another repository as well. (These tests would run in broader Chrome tests, but not in the V8 shell, and not before each commit, the way test262 tests do.)

@rwaldron and I wrote a report that is now published on the project's wiki including a general plan for further improvements.

Of the things in this document, I still think that work on eliminating copyright headers is not the best use of time. Lots of lawyers and little benefit. I'd be interested in hearing more from @ljharb about why he suggested that as a place to change things.

There have been issues about maintainer resources raised on this thread. Maybe a good complement to the other documents would be information about what code reviewers need to keep in mind before approving and committing tests. This could lighten your load a little. If we move to two-way sync, such a document will be useful for anyone who reviews patches against anything that's doing two-way sync with test262.

Your document includes "meaningful test names". In V8, files generally have meaning too, but the meaning might be a little obscure or not so pretty. For example, there is a test whose file is called "modules-import-5.js". IMO this name is fine--it relates to module imports, and it's one of many. Would that be acceptable?

Domenic raised the issue of making description optional, deferring to a case-by-case judgement about whether the test is clear by itself. You mentioned a goal of making descriptions more clear. How does this relate to requirements for code review?

It'd be nice if your document could address test deletions as well. I still don't have a clear idea of what makes a test out of scope, the way that some tests have been deleted for that reason.

littledan commented 7 years ago

It sounds like Test262 wants to stay close to the spec, which means it can't provide tests for implementation-defined behavior that is extremely common among implementations.

I think we could have two-way sync anyway; there are tons of tests that V8 has that could be two-way sync'd without relying on implementation-defined behavior. For two-way sync in Chrome, there's one directory that's sync'd and another that's not. You'd ideally write all tests that don't rely on implementation-specific behavior in the sync'd directory, and the implementation-specific ones in the other directory.

There's another question which you could call "prettiness", which is sort of unrelated to whether tests are indicated by the spec. This has to do with the level of abstraction in tests, the description and info fields, etc, and nothing to do with which semantics the tests are covering.

And there's another question about very theoretical spec compliance issues, which relate to "any embedding environment" kind of concerns--it'd be possible, per spec, to construct a weird embedder with certain properties. I think so much is possible this way (especially if you take into account that the web has deliberate spec violations, and that an embedder could shadow any of the built-ins) that we just have to solve these issues on a practical, case-by-case basis of real embedding environments, rather than theorizing too much. And when we do encounter such a conflict, try to find a way to keep the test running on the environments which don't invalidate it.

ljharb commented 7 years ago

Eliminating copyright headers would take virtually zero of our time; certainly it'd involve lawyers (but since the majority of projects do not have in-file headers, it seems like something that they could resolve relatively quickly, given the precedent) - but that legal debating can happen in parallel to whatever we choose to do here. It just doesn't impact our time at all, until a change is approved.

I mentioned it because it's redundant, because most projects don't actually do it in my experience, because legal boilerplate scares off new contributors, and because it's yet another straw on the metaphorical camel's back of steps contributors have to take to comply with requirements.

leobalter commented 7 years ago

@ljharb:

but that legal debating can happen in parallel to whatever we choose to do here

+1. And that's already happening.

and because it's yet another straw on the metaphorical camel's back of steps contributors have to take to comply with requirements.

exactly.

littledan commented 7 years ago

While we're on the subject of improving contributions, I wanted to mention that @bakkot and the others at Shape Security have assembled a set of JS parser tests here. They are currently not integrated into test262 and are held separately. For V8, at least, there isn't automated infrastructure to pull them down and run them the way that there is for test262. It might make it easier for test users if these could be included in the same repository, but then it might make test262 messier. We discussed where the tests should go at TC39 once, with test262 maintainers arguing against including them in the main repository.

tc39 / test262

How do we improve contributions? #1258