Add line wrapping / breaking to emitter

iansan5653 commented 4 years ago

Currently, the emitter seems to have no option or support for breaking lines at a certain length. If a parameter description has a 500-character block of text, all of that text will be on the same line. It would be really useful to have some sort of parameter or setting to automatically wrap text when a certain number of characters is reached in a line.

octogonz commented 4 years ago

👍 Yes, we should definitely implement this! Doing so would enable a tool like API Extractor to normalize the comment format when it writes a .d.ts file for release.

Another idea would be to emit the DocNodeKind.SoftBreak as a newline. If I remember right, the reason this wasn't implemented is that trimSpacesInParagraphNodes() discards the soft breaks when it is normalizing the text. We would need to pass them along somehow as hints to the emitter.

octogonz commented 4 years ago

When I started working on this, I realized that the emitter should probably support three separate modes:

verbatim: Emit the AST "as-is" without any reformatting. This would be used e.g. by a refactoring tool that wants to rename a @param name without disturbing anything else. It should perhaps be the default. It's simple to support since the AST already captures all whitespace; we simply need to disable the trimSpacesInParagraphNodes() transform.
trim spaces: This is the current behavior, where unnecessary newlines are discarded, and the lines are unwrapped. The main value is that it makes it easier to emit Markdown correctly, since Markdown engines tend to misinterpret extra spaces/newlines. But when emitting .ts comments (instead of .md files), it produces ugly long lines as you pointed out.
trim spaces and word wrap: This would add some extra logic to trimSpacesInParagraphNodes() that re-wraps the paragraphs to a specified column. It could be most useful for a code prettifier, or to prettify generated output.

@iansan5653 I'm wondering, would verbatim be more appropriate for your application than trim spaces and word wrap?

octogonz commented 4 years ago

Also note that word-wrap probably cannot be applied to other sections such as DocCodeSpan or (in the future) markdown headings, unless we introduce a comment-wrapping operator like proposed in RFC #166.

iansan5653 commented 4 years ago

would verbatim be more appropriate for your application than trim spaces and word wrap

I don't think so, unless I misunderstand what you're asking - I'd like to make a tool that, no matter how the input content is formatted, always produces the same standardized output. Lines wrapped, tags in the same order, newlines where they should be and not where they shouldn't, etc.

octogonz commented 4 years ago

I'd like to make a tool that, no matter how the input content is formatted, always produces the same standardized output. Lines wrapped, tags in the same order, newlines where they should be and not where they shouldn't, etc.

👍 Got it. I'll see if I can implement this. I started work on it over the weekend, but ran into some deeper architectural questions that I need to think about before I write too much code.

rbuckton commented 4 years ago

I don't think this comment is entirely accurate:

[...] Markdown engines tend to misinterpret extra spaces/newlines. [...]

Markdown doesn't "misinterpret" extra spaces/newlines. Rather, Markdown is a whitespace-significant language and has very specific behavior with regards to the number of whitespaces and newlines that it encounters. Here are a few common examples:

Text indented 4 (or more) spaces from the start of a line is considered a code block, without the need for ``` fences:

this is code

Single newlines are considered insignificant when in a paragraph, but significant in bulleted lists, tables, pullquotes, etc.:

These lines
are a single line.

- But these
- Are different

> These lines
> are also a single line but the leading `>` 
> of the subsequent lines is ignored.

| Col 1 | Col 2 |
|:--|:--|
| Single-lines are important | for Markdown Tables |

Double newlines are considered a new paragraph (i.e., <p>):
```
These are

Different paragraphs
```
Two space characters followed by a single newline character is considered a hard break (i.e., <br>):
```
There is a hard break here:  
Making this a new line in the same paragraph.
```

In my opinion, the best thing to do is to parse out the TSDoc specific syntax (@ tags and {} inlines, etc.) and trim the leading * from each line in a doc comment, but preserve the rest essentially verbatim.

You can find the latest specification for commonmark (the Markdown spec that Github-Flavored Markdown is based on) here: https://spec.commonmark.org/0.29/

rbuckton commented 4 years ago

Note that whitespace is also significant inside a pullquote:

> line 1
>
>     code
>
> line 2

line 1
code
line 2

octogonz commented 4 years ago

In my opinion, the best thing to do is to parse out the TSDoc specific syntax (@ tags and {} inlines, etc.) and trim the leading * from each line in a doc comment, but preserve the rest essentially verbatim

@rbuckton We started with this idea. However, it conflicts with two of TSDoc's overarching goals:

Unambiguous syntax: Every tool should agree about how TSDoc syntax is parsed. (Semantic differences are okay, though. For example: skipping some unsupported tags, or rendering tags differently.)
Predictable rendering: Users should be able to accurately predict how their input will get parsed, without relying on a preview. Traditional Markdown is usually authored with an interactive preview window, but TSDoc may not have his luxury, e.g. when reviewing PRs on GitHub, or if the DocFX pipeline runs only after a PR is merged.

I originally thought that CommonMark would address these concerns. But it doesn't. CommonMark has many gotchas where an expression gets parsed unexpectedly. And as a unifying "standard", CommonMark turned out to be a standard that nobody actually implements: Every single Markdown engine adds its own proprietary grammar extensions that, when used, can cause an entire input to be misinterpreted by other engines.

As evidence, consider this code:

| Col1  | Col2 |
| --- | --- |
| `{@link X | Y}` |
| {@link X | Y}  |

How it gets parsed:

CommonMark sees it as plain text, with a single <code>{@link X | Y}</code> in the middle
Jekyll (which uses Kramdown) sees it as a table with <code>{@link X | Y}</code> all in one cell, and the second row is split into {@link X and Y.
GitHub sees it as a table with the non-code `{@link X in the first cell, and Y}` in the second cell.

(There are endless examples like this. If you put ``` on the first line, some engines treat the whole file as code, others treat the whole file as not code.)

TSDoc's concern: Is there a `@link` tag in this comment or not?

In the above situation, this question has no clear answer. We considered mitigating this by modeling TSDoc as a preprocessor, that grabs its tags in a simple-minded way, and then passes along the remaining content uninterpreted, with no attempt at consistency between tools. But even if consistency doesn't matter (I believe it does), we found that the resulting grammar was highly counterintuitive. It wasn't a pleasant authoring experience.

Thus, after a very long discussion we came to the opinion that TSDoc should have its own "TSDoc-flavored-Markdown" with the following properties:

A very conservative subset of Markdown constructs that is sufficient for everyday API documentation needs
Accept certain nonstandard grammar simplifications if they make the rendering more predictable
If possible, try to minimize any reliance on nesting blocks and whitespace-based rules
Recommend for people to use HTML tags for any complex/nesting structures (e.g. tables)
Extensions to TSDoc must not alter the grammar, instead, extensions are restricted to custom tags or HTML elements that everyone can parse

In practice this has worked very well. I still haven't gotten around to adding basic features like boldface, headers, bullets, etc. -- which we do want to support -- but already people have written A LOT of very good documentation with relatively few complaints about missing constructs. Doc comments embedded in source code really don't need a whole lot of bells and whistles, it seems.

So, to recap: When you use API Documenter for example, your TSDoc-flavored-Markdown gets fully parsed into an AST. Later, when the MarkdownEmitter writes the .md/.yml output file, it is very thoroughly escaped to ensure the emitted Markdown correctly captures TSDoc's interpretation. We really do not want any Markdown extensions to work unless they are part of the TSDoc-flavored-Markdown grammar.

microsoft / tsdoc