RFC: Improving bug reports by supporting inline reproductions

orta commented 4 years ago

We want to provide better tools for providing reproductions of TypeScript bugs in issues. In an issue we really only have text to work with, so we need some kind of text representation of a TypeScript project.

Classes of issues

Looking through the bugs label, there are a large amounts of cases we can probably address:

Bad JS/DTS Emit
This code should be this type
This code is slow to compile / use IntelliSense
This code crashes
This code should error
This code should not error
This refactor didn't do what I expected

There are a few we probably can't:

This code infinite loops

I'm interested in expanding either of these lists! ^

Possible solution

Two parts:

We use something like arcanis/sherlock to run any issues with re-pros on a nightly-ish schedule with different builds of TypeScript and store the results using something like this - this runs TypeScript code with markup which is built for repros
A page on the website which helps you generate a single typescript codeblock which is a reproduction of a bug

Possible Markup

We've been working on a TypeScript markup language for the website and this called ts-twoslash which adds commands inside comments. I'd recommend reading the README for that before expanding some of these examples.

Bad JS Emit

```ts // @showEmit // @emitShouldContain: this string // @emitShouldNotContain: this other string console.log("hello world") ```

Bad DTS Emit

```ts // @showEmit // @showEmittedFile: index.d.ts // @emitShouldContain: 123 // @emitShouldNotContain: 456 export const myVariable = "123123" ```

This code should be this type

```ts // @noImplicitAny: false const a = 123 // ^? = const a: 123 ``` or ```ts // @noImplicitAny: false const a = 123 // ^? != const a: any ```

This code is slow

```ts // @slow const a = 123 // ^? ``` This annotation could indicate that it should fail if it takes over 1s

This code crashes

```ts const a = 123 a.l ``` In theory there shouldn't need to be a crashes annotation, all code should not crash. I'm open to debates on why that might need to be different though.

This code should error

```ts // @expectErrors: 4311 const a = "hello" a * a ```

This code should not error

```ts const a = "hello" a * a ``` The lack of an error annotation should indicate this. If it is expected that there should be some errors, then it can use the `@errors: 3211` which is already set up

This refactor didn't do what I expected

I'm still a bit hazy on this, in fourslash there are a lot of options here - but to try provide a really high-level ```ts const a = "hello" a * a // ^^^^^! refactor ``` This could create a JSON dump of all the refactoring options maybe?

Web Page

We have been working towards having the playground support a custom set of tabs in the sidebar, it's not a far reach to imagine that an editor for twoslash could be the playground but with tabs focused on running ts-twoslash and getting useful results.

Here's a rough idea of what that could feel like:

Screen Shot 2019-11-27 at 12 03 02 PM

The blockers on that:

https://github.com/orta/make-monaco-builds/issues/2
- Shipping the sidebar of playground

RyanCavanaugh commented 4 years ago

I really like the idea of a workbench.

I think it's sufficient for the markup language to be able to show the setup behavior without prescribing the expected output; the author can describe in prose what they don't like about it. We've found that emitShouldContain/etc style tests are usually too sensitive to "spelling" effects (e.g. Array<T> vs T[]).

For something like a refactoring, showing the refactorings available (perhaps filtered by a search string?) and their post-render output would be sufficient.

Error message numbers only have "best effort" stability and I'd generally avoid trying to hardcode them. This is also subject to oversensitivity; when we add a new "Here's what we think is going on" error message on top of an existing error message, the observed top-level error code changes. Error positions are also subject to some movement; this is why I generally lean toward "Let people demonstrate the behavior" rather than "Let people prescribe the behavior". Showing version-to-version changes in the demonstrated behavior is also likely to be more instructive than pass/fail.

elibarzilay commented 4 years ago

What I had in mind is something that is more focused on how to implement this, and specifically independent of the style of tests -- anything goes, as long as it has support in the TS repo.

With this in mind (and therefore nothing about the front end that you'd actually use), the functionality that would be implemented is best left for just the tests that are in the repo -- so in the below I'm talking about fourslash and baseline tests, but if the twoslash thing is added, then that would be another option. (IOW, I see this as independent of the style of tests.)

Take the whole repo, build it, and dump the result in a docker image (almost as is, removing unnecessary stuff, and possibly with some interface script described below).
There would be a classification of test files by their "style" -- baseline or fourslash, and some files would not be classified (tests that are more involved than a one-file thing (like the docker tests), or things that are not test cases by themselves).
The resulting image has two purposes:
- given some text, grep known files for that text (including the file names)
- run gulp runtests --t="someTest"
When someone interacts with this:
- the first step would be to grep for something similar and/or decide on the style of tests.
- then, edit a file that they found, or write a new one
- then the file is executed via a docker run with that one file mounted in the right place, and with an filename and an argument that makes only that one file run and the results rendered but
- (the docker run would include a timeout and memory limits)
- these rendered results should normally be a test failure: an uncaught exception, a timeout, running out of memory; or in the case of a baseline test, the resulting files
At this point, the test case could be attached to a PR with some additional human explanation. That explanation could be empty (a test just failed, and it shouldn't), or it could describe what's wrong with the output in the case of a baseline test (shouldn't get errors, something in the types file should be different etc). The bottom line is that after this there is a test case ready to add.
When the attachement to a PR is done, also add a "test case available" label.
Better to attach to an existing PR rather than make it create a PR, since I see bugs where people have interest in helping so this accomodates X submitting a bug, and Y adding a test case.
(In the world I came from there would be concern about adding a label only after verifying some signature, but this is all just simplifying what people can write, and no tests would be added automatically anyway, so there's no need for any of that.)

orta commented 4 years ago

Cool @elibarzilay - this looks like a case of creating a setup where we get working test cases which matches our own in the compiler and it gives us the freedom to cover crashing / OOM / infinite loop tests.

I'm open to building the front-end to something like this, but I'm wary of building a backend to support this myself. My past experience building something similar was that this will be quite a lot of maintenance and work. If that's something you're up for building and owning, then I can think about how we teach and build out a UI for it in the site.

FWIW, I still prefer like my original proposal for a few reasons:

Low barrier to entry for people wanting to make reports because twoslash is barely any different from normal typescript, and easy to document
No server to maintain and secure, it runs in a browser or inside node via cron'd GitHub Actions
Shares underlaying infra code with the site, which means it will get upkeep almost for free

Which is basically a worse is better argument. It does a lot of things worse:

It's a unique way to write tests (being neither fourslash, nor baselines)
It'd require us to take those issues to make our own tests when fixing issues
It doesn't cover all cases of tests
It's basically an approximation of the underlaying bug

But it covers most of the cases and is more convenient

orta commented 4 years ago

Update

I have an (unpolished) bug workbench in the staging site:

Screen Shot 2020-07-01 at 8 53 00 AM

Yesterday, I got the verification steps working with twoslash code samples as a GitHub Action: https://github.com/microsoft/TypeScript-Twoslash-Repro-Action

This action verifies any code samples in an issue body, or comments against the last 6 versions of TypeScript and the latest nightly build. I just need to work on the reporting aspects of it.

Each run keeps track of:

Any highlighted queries
Compiler errors
Emit output
Compiler crashes

weswigham commented 4 years ago

This action verifies any code samples in an issue body, or comments against the last 6 versions of TypeScript and the latest nightly build. I just need to work on the reporting aspects of it.

Suggestion: why not have the action bisect releases (if it fails the most recent release)/nightlies (if it passes on the most recent release but not the most recent nightly), so it can identify if the issue is a regression, when it was introduced.

orta commented 4 years ago

Aye, It should support this OOTB today (going from green to red on a new nightly run is just another state change it would detect) - historical results are always kept around, for example this comment shows some different results over time with different builds of TS

As it runs every night, you'd get the list of commits between the two nightlies in the comment which should make it easy to track down

microsoft / TypeScript