Request for comment: CRI Version 1

I've turned everything we have so far into a draft specification: https://github.com/js-reporters/js-reporters/blob/main/spec/cri-draft.adoc

To gain experience, it is currently used in stable releases or production versions of:

js-reporters, provides adapters for Mocha, Jasmine, and QUnit.
QUnit 2.2+ implements the draft spec natively as well.
js-reporters, provides a reusable TAP reporter.
browserstack-runner, an integration reporter by BrowserStack to collect Mocha/Jasmine/QUnit results from the cloud.

Objective

To publish Version 1 of the specification "soon".

This ticket is to track the review round during which we can answer any pending questions or past issues that we want to solve as part of Version 1. I think this area is interesting enough that there is definitely space for lots of ideas and future considerations. But, to deliver a long-term and stable Version 1, I think we need to limit our scope through agreed-upn criteria.

Proposed criteria:

The spec is capable of correctly describing the essential information from test runners.
All specified events and required properties can be meaningfully populated by each of the studied testing frameworks.
We only include optional events or optional properties in the standard, if:
- It is clear for consumers (reporters, and other integrations) what it means when it is set vs not set.
- That the optional property is safe to sometimes be set and sometimes not within a single test run. This would allow easy aggregation, and would means consumers likely don't have to track any kind of state or assumptions between events.
- That the optional property is relevant for at least two known testing frameworks.
We believe it's likely that the ideas we have left for future versions can be implemented on top of this without it constituting a breaking change. That is, they would not impose new requirements for runners or reporters, old runners can transparently feed newer reporters, new runners can transparently feed older reporters.
- If this is not feasible, then we may need to implement a "version" field into the spec, e.g. as part of the runStart event, akin to the TAP specification.

These are admitedly quite strict and narrow. Maybe they're feasible, but if not feel free to suggest changes and widening. It's only meant as a way to start and focus the conversation.

Open questions

[x] 1) What to do about the todo field on the Assertion event data? See https://github.com/js-reporters/js-reporters/issues/105.
[x] 2) Reduce persistance of assertions. Maybe change SuiteEnd per https://github.com/js-reporters/js-reporters/issues/100? Or remove entirely, per https://github.com/js-reporters/js-reporters/issues/126?
[ ] 3) Allow Error objects to be used as-is to communicate failed assertions. See https://github.com/js-reporters/js-reporters/issues/123.

Some challenges with this for node-tap:

Why are "suite" and "test" two separate concepts? Isn't a "suite" just the top-most "test"?
The specification seems to focus on a JS api with events passing between a "runner" object and "suite" or "test" objects. That's not how node-tap's tests are run. All tests are standalone programs that write TAP to stdout. Is there any plan to support a pure text stream style interface for expressing and consuming test/suite/runner events, or is that just a different paradigm altogether?

Interesting, I had not considered those as similar, but you're right. I'd support merging the "test" and "suite" concepts. It might require some compromises, but I think it's worth giving a try.

Currently:

a "test" is the lowest measured unit, typically executed after the phase where a framework will load the test suites and plan their execution. A framework might measure runtime duration around a test, and e.g. invoke hooks (such as "before" or "after" each hooks). These aren't directly of concern to reporters, but helps illustrate what they relate to.
a "suite" is used for grouping tests, and suites can be nested.
the top-most suite is referred to as a "run".

I have not previously seen testing frameworks that allow nesting within their equivalent of the "test" concept. I see that node-tap does allow this. As such, node-tap's tests are effectively a hybrid. Like a "suite" that can also directly contain assertions.

The main issue I see with merging these concepts is that it might affect support for "planning" the execution of a run. Right now, we specify a structure that reporters can read that describes how many tests will execute and in which suites they are grouped, before execution begins.

@isaacs How does this work with node-tap? The TAP spec defines a "plan" (spec), which Mocha, QUnit, and Jasmine effectively implement by waiting until all test files have been loaded, seeing what "suites" have been called, but not actually executing the "test" portions yet. It will then know this information, and then can start execution.

I'm generally in favour of removing the structured plan (e.g. the whole suite structure) since this is afaik never used in practice by reporters. A stream-based approach seems preferable there. However an overall count of planned tests seems useful to keep, allowing reporters to track and communicate a sense of progress toward completion.

ftr, tape also allows nesting tests (groups of assertions), and has no concept of "suites". In tape, you can t.plan(n), but you can also call t.end() or return a promise, because no plan is needed in a synchronous test, or when the promise can signal completion.

In other words, tape simply doesn't have a full count of tests until after the full run is completed.

Yeah, tape is sort of in the TAP family of test paradigms (along with ava, to some extent), so most of what's said about node-tap would apply to tape as well.

How does this work with node-tap? The TAP spec defines a "plan" (spec), which Mocha, QUnit, and Jasmine effectively implement by waiting until all test files have been loaded, seeing what "suites" have been called, but not actually executing the "test" portions yet. It will then know this information, and then can start execution.

The tap runner does not load all the test files ahead of time. All tests are run as subprocesses to ensure a clean environment, and their TAP stream is consumed from the child process stdout. A pool of tests are loaded in parallel, by default up to the count of CPUs on the system. As tests finish, new ones are kicked off, but the output order is always consistent (if you specify -R tap, at least).

Within a given test file (ie, in the child process), the object returned by require('tap') has a t.plan() method which can be used to specify a number of top-level tests to expect, but the default behavior is to end automatically once all tests are completed and the process ends.

tap.Test objects (including the root test object returned by require('tap')) are created by calling t.test(name, (t) => ...) and passed as an argument to the function. The function can either call t.end() explicitly at some point, use t.plan() to specify a number of test points to expect, or return a Promise. They can also create subtests, which can create subtests, etc, to any arbitrary depth.

The top-level tap test object doesn't have to create subtests. This is a perfectly valid test script in node-tap:

const t = require('node-tap')
t.pass('this is fine')

If they call t.end() synchronously, or specify a plan which is immediately resolved, tests are run in a single tick, immediately moving on to the next test. So for example:

// all console.error calls will happen in order
const t = require('tap')
console.error('one')
t.test('child test', t => {
  console.error('two')
  t.pass('okie dokie')
  console.error('three')
  t.end()
  console.error('four')
})
console.error('five')
t.test('second child', t => {
  console.error('six')
  t.pass('fine')
  console.error('seven')
  t.end()
  console.error('eight')
})
console.error('nine')

Will produce the output:

$ node t.js
one
TAP version 13
two
# Subtest: child test
    ok 1 - okie dokie
three
    1..1
four
ok 1 - child test # time=2.915ms

five
six
# Subtest: second child
    ok 1 - fine
seven
    1..1
eight
ok 2 - second child # time=0.761ms

nine
1..2
# time=9.445ms

The runner works by getting a list of all the test files that it needs to run, and then calling t.spawn() to run them all from the root tap object (the same object that test files, albeit in a different process). If a reporter is used (other than -R tap of course) then the output of that root tap object is sent to the reporter rather than dumping to stdout. (-j1 means "only one job, don't do the buffering/parallelization stuff", trying to keep it simple here.)

$ tap -Rtap t.js -j1
TAP version 13
one
# Subtest: t.js
two
three
    # Subtest: child test
        ok 1 - okie dokie
        1..1
four
five
    ok 1 - child test # time=2.849ms

six
seven
eight
    # Subtest: second child
        ok 1 - fine
nine
        1..1
    ok 2 - second child # time=0.629ms

    1..2
    # time=9.076ms
ok 1 - t.js # time=9.076ms

1..1
# time=413.414ms

So, really, there is no difference between a "suite" and a "test", we don't often know ahead of time how many tests there will be, and tests can have subtests so really either there's no such thing as a "suite" or it's suites all the way down, and the objects that this spec talks about emitting events for one another are all in different process memory spaces.

In the default reporter, I do report the number of "suites", but that's just the number of test files, which is an arbitrary distinction to draw. All the communication happens over text streams that speak and parse TAP.

@isaacs Thank you, that helps!

JS API

The specification seems to focus on a JS api with events passing between a "runner" object and "suite" or "test" objects.

The CRI spec isn't meant to imply JavaScript. The repo's name and heritage do come from that background, though, that's fair. It's important that we (at least) succeed in the JS space to unify the many non-interopable test frameworks, runners, and reporters. But, like TAP, I think CRI could be environment agnostic, specifying only a stream of information.

Join forces?

TAP focusses on the transmission protocol for the (virtual) wire, CRI focusses more on how the data gets on and off that same wire.

In TAP, the line protocol is concrete and the transmission method is left unspecified. In practice, I believe one generally expects to see separate proceses connected through a Unix pipe. When things are not inter-process, though, one could presumably agree unofifically on exposing the runner's writable stream, and exposing the reporter's consumer of the readable stream, and plug-in something other than stdin/stdout. But this is not standardised. CRI aims to close that gap.

I love the Unix philosophy, and this makes perfect sense for Node.js and other CLI contexts, especially when there's a single test run, and a single reporter. So far, I've not had a good sense of how this can be made to work outside that context in a way that is still interoperable. For example:

In a browser, with a reporter using the DOM to display progress about a test running on that same web page. The two systems need to get connected with each other, and the reporter needs to understand the runner's data.
In a more complex browser setup, like Karma or BrowserStack, where multiple browsers are launched, and one or more test runs take place. There is typically a small client injected onto the page, which acts as a reporter that will discover the test framework, capture its information, and beacon it off to a server (e.g. the Node.js process that launched the browser might also have web server for receiving this information). Then on the CLI server, the server acts as a producer replaying the information for reporters to consume. These may then pretty-print to stdout, produce artefacts on disk, forward to another service somewhere, etc.

In the CRI draft we've described what is essentially a callback mechanism or minimal event emitter, with an abstract method that should exist for connecting a reporter (consumer), which then gets called with the described structured data in a language-native fashion.

Once everything is connected, the format of the data is currently a weak point in CRI. Perhaps we should move away from tryingto describe the structure in a language-neutral way and just say that it is "text", e.g. JSON-encoded, or even TAP. This would decoupled CRI and remove any ambiguity about how structures might be represented. if there's more than one way to do so in a given langage. It'd be up to the reporter to decide how they parse the stream.

My dream for CRI is that any test reporter can be declaratively plugged into any test framework or test orchestrator (BrowserStack, Karma, ..). Thus for the many reporters that produce artefact files, upload stuff, pretty-print etc to be re-usable, and for HTML reporters to be re-usable as well.

The below is JS-specific, but as concrete outcome for the JS ecosystem, I consider it sucess if we can:

Declare to a an orchestrator in code-free configuration: Launch browser X, open test/foo.html, inject your magic and call SomeTestFramework.someMethod to get everything you need. The only thing it needs to know about the test framework of choice is how to open the stream.
Similarly, for regular browser contexts (not headless), one should be able to load a reporter module and plug it into the test framework with no more than a single function call. E.g. SomeTestFramework.someMethod(FooReporter), FooReporter.consume(SomeTestFramework), or SomeTestFramework.readable.pipeTo(foo.writable)

I haven't considered TAP for this, but I am open to it. I do think we'd most likely still need something within this spec to adress the above, but we could certainly leverage TAP. I think that's worth pursuing. We may need to solve one or two issues upstream in that case, but I wouldn't mind helping with that! E.g. regarding nesting of tests, and (optional) test duration, and details of passed assertions. We might be able to do that as an extension in an ignored comment.

@isaacs I see this that "extended TAP" is the approach node-tap has taken. Could you share a bit about how well that has gone so far, in particular whether there is work underway on potentially formalising a common way for the extended data to be formatted so that e.g. a tap parser would be adopting that and providing it as optional a key-value object to consumers of that parser. Is that a direction that would make sense?

I'm also welcome any thoughts or feedaback on the CRI challenge of connecting consumers to producers.

Misc

Regarding test plans - I didn't realize the test plan "header" could be at the end of a TAP stream. That makes a lot of sense (since TAP 12).

Regarding nested tests - Yep, we need to support this. I consider it a requirement to be able to represent the structure of node-tap and tape with little to no loss in detail/semantics. I filed https://github.com/js-reporters/js-reporters/issues/126 for this.

@ljharb I think this is now addressed as part of https://github.com/js-reporters/js-reporters/pull/129. Could you take another look?

It seems OK, but it's hard to know without some kind of programmatic verification tooling/testing.

Thanks for all the input. Continuing the conversation at https://github.com/js-reporters/js-reporters/issues/133. Next I'll see what we can do with Karma runner, BrowserStack, and in-browser reporting from frameworks the don't yet support TAP there today.

qunitjs / js-reporters