Criteria for test relevancy

jugglinmike commented 8 years ago

From time to time, we receive suggestions for tests based on known bugs in existing implementations. These tests usually assert some expected behavior that is implicitly defined by the specification. In other words, one wouldn't expect them to fail unless they had some domain knowledge about specific implementation details.

This aspect concerns me somewhat because it can make it harder to set expectations for contributions. If there is a known bug in 3 major engines (e.g. gh-649), then the value of the test is clear. But if, for example, Espruino alone has some surprising behavior in a very specific case, should Test262 include tests for that?

A simple policy would be, "yes, we accept any test as long as it conforms to the ECMAScript standard." But this can make reviewing "greenfield" tests more difficult: as I mentioned elsewhere, an "anything goes" policy can make the review process more difficult. In the worst cases, the contributor would feel indentured to the arbitrary whim of their reviewer(s). I sometimes worry that @bterlson fell victim to this in the months-long review precess for async functions.

I have personally submitted implementation-specific tests in the past, but I've always had mixed feelings about it.

I think what would satisfy me would be a policy like this:

We accept tests for behaviors that are explicitly defined in the ECMAScript specification. In addition, if a bug has been exhibited by two or more independent implementations, we will accept tests asserting the correct behavior.

Does this formalism seem helpful? Or is it just a case of my own pedantry run amok?

littledan commented 8 years ago

I don't think we need a strict policy about which tests are irrelevant like this. It does seem reasonable for some test262 contributors to prioritize their work this way. But I don't see any reason to reject tests contributed which end up passing in 3/4 browsers. I don't know how one would determine whether a test is covering explicitly defined behavior or bugs--bugs mean that an implementation doesn't follow explicitly defined behavior, right? A vague definition sounds fine for you and others to prioritize your own work, but not as nice as a way to keep tests out.

bterlson commented 8 years ago

I think it would be good to have rough guidelines, but it seems very hard to define in a rigorous enough way that we will remove a significant amount of subjectivity. Even in the realm of clear spec-targeting tests there is an almost infinite space of tests we could require (eg. testing with every sort of value, boundary value analysis, etc.). Eventually these sort of tests move from being useful to useless with the exception of those that are explicitly useful due to known implementation bugs.

Since the space of valid spec-targeting tests is infinite, it might actually be the case that contributors' time does an effective enough job of gating test contributions. If someone actually goes through the trouble of writing a test and submitting a PR, chances seem very good that it's a useful test for some reason or other.

In other words, documenting the minimal set of tests someone should write for a feature is very good, but I'm not convinced we should document what tests we don't want.

jugglinmike commented 8 years ago

Thank you both; your perspective brought me back from the cliff of "eliminate all subjectivity from everything," (I find myself there often).

When it comes to, "documenting the minimal set of tests someone should write for a feature," I'm interested in helping with that. I think it will take some more thought, and the discussion probably doesn't belong on this thread. I'll close this issue, and I'll follow up with another when I have some time for that additional documentation.

tc39 / test262

Criteria for test relevancy #650