Move to Electron's docs-parser tooling for Node.js documentation

bnb commented 4 years ago

Over the past four months, I've gotten more and more involved in Electron as a project. Generally, they are extremely forward on automation and tooling intended to reduce maintainer burden since they are such a small team maintaining such a large project.

One of the tools they created, docs-parser, is especially interesting and I think Node.js could benefit from adopting it as a fundamental piece of our documentation tooling.

Why

I see quite significant benefits to adopting docs-parser:

More consistency in Node.js documentation. Docs parser requires you to write in according to the documentation styleguide that Electron uses. As a reader and consumer I've never had a particularly good experience using our docs, and have . Whether it's the hierarchical improvements or the relative consistency from section to section, docs-parser's enforced writing style helps ensure that our users' experience while consuming documentation is consistent.
Straightforward detection of missing context. In the example above, you can see that each of the properties on the object returned by getHeapStatistics() have an empty "description" property. This emptiness is extremely useful in finding places where additional context can be added. For example, it's trivial to write a tool that checks for empty "description" properties. This provides an excellent path forward to enriching our documentation with context that's useful to users who don't have the same understanding that we do.
Potential internal tooling to reduce maintainer burden. Electron doesn't just ship this by itself. They have two other tools, typescript-definitions and archaeologist which - when used together - provide a GitHub check that surfaces a representation of the documentation API as a .d.ts, diffing the current PR's representation with the representation of the documentation in master. This provides a way for maintainers to parse out the changes to APIs in code in addition to just personally reading the docs themselves.

How does docs-parser differ from what we currently have?

Document Structure

docs-parser requires a specific document structure that is currently documented in the Electron Styleguide API Reference section.

This structure has specific expectations about titles, descriptions, modules, methods, events, classes (undocumented here: multi-class mode, which would be needed in Node.js and is currently supported by docs-parser), static properties, instance methods, and instance properties.

I've worked on converting Node.js's querystring, v8, and worker-threads docs in a personal repo which you can find here in the docs/api directory: bnb/node-docs-parser. Please take a look if you're interested in what the differences are - the first commit on each file was the state I started with and the final commit in each is the docs-parser version. Additionally, there's a directory with the original versions that you can compare side-by-side if you'd prefer to approach it that way.

In doing this I found that - while a few things did need to be shuffled around to correctly parse - the overall updated structure was more clear and approachable with minimal additional effort.

Actual Markdown

docs-parser requires a comparatively strict structure around markdown input, since it directly parses markdown.

Here's an example from Node.js:

## `querystring.escape(str)`
<!-- YAML
added: v0.1.25
-->

* `str` {string}

The `querystring.escape()` method performs URL percent-encoding on the given
`str` in a manner that is optimized for the specific requirements of URL
query strings.

The `querystring.escape()` method is used by `querystring.stringify()` and is
generally not expected to be used directly. It is exported primarily to allow
application code to provide a replacement percent-encoding implementation if
necessary by assigning `querystring.escape` to an alternative function.

And here's the current equivalent in docs-parser:

### `querystring.escape(str)`

- `str` String

The `querystring.escape()` method performs URL percent-encoding on the given
`str` in a manner that is optimized for the specific requirements of URL
query strings.

The `querystring.escape()` method is used by `querystring.stringify()` and is
generally not expected to be used directly. It is exported primarily to allow
application code to provide a replacement percent-encoding implementation if
necessary by assigning `querystring.escape` to an alternative function.

They seem nearly identical, and indeed they basically are. This is a good example of how minor some of the necessary changes are. Here's another slightly more complicated example:

Node.js version:

### `v8.getHeapStatistics()`

Returns `Object`

* `total_heap_size` number
* `total_heap_size_executable` number
* `total_physical_size` number
* `total_available_size` number
* `used_heap_size` number
* `heap_size_limit` number
* `malloced_memory` number
* `peak_malloced_memory` number
* `does_zap_garbage` number
* `number_of_native_contexts` number
* `number_of_detached_contexts` number

`does_zap_garbage` is a 0/1 boolean, which signifies whether the
`--zap_code_space` option is enabled or not. This makes V8 overwrite heap
garbage with a bit pattern. The RSS footprint (resident memory set) gets bigger
because it continuously touches all heap pages and that makes them less likely
to get swapped out by the operating system.

`number_of_native_contexts` The value of native_context is the number of the
top-level contexts currently active. Increase of this number over time indicates
a memory leak.

`number_of_detached_contexts` The value of detached_context is the number
of contexts that were detached and not yet garbage collected. This number
being non-zero indicates a potential memory leak.

<!-- eslint-skip -->
` ` `js
{
  total_heap_size: 7326976,
  total_heap_size_executable: 4194304,
  total_physical_size: 7326976,
  total_available_size: 1152656,
  used_heap_size: 3476208,
  heap_size_limit: 1535115264,
  malloced_memory: 16384,
  peak_malloced_memory: 1127496,
  does_zap_garbage: 0,
  number_of_native_contexts: 1,
  number_of_detached_contexts: 0
}
` ` `

docs-parser version:

### `v8.getHeapStatistics()`

Returns `Object`

* `total_heap_size` number
* `total_heap_size_executable` number
* `total_physical_size` number
* `total_available_size` number
* `used_heap_size` number
* `heap_size_limit` number
* `malloced_memory` number
* `peak_malloced_memory` number
* `does_zap_garbage` number
* `number_of_native_contexts` number
* `number_of_detached_contexts` number

`does_zap_garbage` is a 0/1 boolean, which signifies whether the
`--zap_code_space` option is enabled or not. This makes V8 overwrite heap
garbage with a bit pattern. The RSS footprint (resident memory set) gets bigger
because it continuously touches all heap pages and that makes them less likely
to get swapped out by the operating system.

`number_of_native_contexts` The value of native_context is the number of the
top-level contexts currently active. Increase of this number over time indicates
a memory leak.

`number_of_detached_contexts` The value of detached_context is the number
of contexts that were detached and not yet garbage collected. This number
being non-zero indicates a potential memory leak.

<!-- eslint-skip -->
` ` `js
{
  total_heap_size: 7326976,
  total_heap_size_executable: 4194304,
  total_physical_size: 7326976,
  total_available_size: 1152656,
  used_heap_size: 3476208,
  heap_size_limit: 1535115264,
  malloced_memory: 16384,
  peak_malloced_memory: 1127496,
  does_zap_garbage: 0,
  number_of_native_contexts: 1,
  number_of_detached_contexts: 0
}
` ` `

However, docs-parser has an interesting technical benefit. Like our current setup, it outputs JSON. Compare the two following JSON outputs:

Node.js JSON output:

{
          "textRaw": "`v8.getHeapStatistics()`",
          "type": "method",
          "name": "getHeapStatistics",
          "meta": {
            "added": [
              "v1.0.0"
            ],
            "changes": [
              {
                "version": "v7.2.0",
                "pr-url": "https://github.com/nodejs/node/pull/8610",
                "description": "Added `malloced_memory`, `peak_malloced_memory`, and `does_zap_garbage`."
              },
              {
                "version": "v7.5.0",
                "pr-url": "https://github.com/nodejs/node/pull/10186",
                "description": "Support values exceeding the 32-bit unsigned integer range."
              }
            ]
          },
          "signatures": [
            {
              "return": {
                "textRaw": "Returns: {Object}",
                "name": "return",
                "type": "Object"
              },
              "params": []
            }
          ],
          "desc": "<p>Returns an object with the following properties:</p>\n<ul>\n<li><code>total_heap_size</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>total_heap_size_executable</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>total_physical_size</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>total_available_size</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>used_heap_size</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>heap_size_limit</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>malloced_memory</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>peak_malloced_memory</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>does_zap_garbage</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>number_of_native_contexts</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n<li><code>number_of_detached_contexts</code> <a href=\"https://developer.mozilla.org/en-US/docs/Web/JavaScript/Data_structures#Number_type\" class=\"type\">&lt;number&gt;</a></li>\n</ul>\n<p><code>does_zap_garbage</code> is a 0/1 boolean, which signifies whether the\n<code>--zap_code_space</code> option is enabled or not. This makes V8 overwrite heap\ngarbage with a bit pattern. The RSS footprint (resident memory set) gets bigger\nbecause it continuously touches all heap pages and that makes them less likely\nto get swapped out by the operating system.</p>\n<p><code>number_of_native_contexts</code> The value of native_context is the number of the\ntop-level contexts currently active. Increase of this number over time indicates\na memory leak.</p>\n<p><code>number_of_detached_contexts</code> The value of detached_context is the number\nof contexts that were detached and not yet garbage collected. This number\nbeing non-zero indicates a potential memory leak.</p>\n<!-- eslint-skip -->\n<pre><code class=\"language-js\">{\n  total_heap_size: 7326976,\n  total_heap_size_executable: 4194304,\n  total_physical_size: 7326976,\n  total_available_size: 1152656,\n  used_heap_size: 3476208,\n  heap_size_limit: 1535115264,\n  malloced_memory: 16384,\n  peak_malloced_memory: 1127496,\n  does_zap_garbage: 0,\n  number_of_native_contexts: 1,\n  number_of_detached_contexts: 0\n}\n</code></pre>"
        },

docs-parser JSON output:

      {
        "name": "getHeapStatistics",
        "signature": "()",
        "description": "* `total_heap_size` number\n* `total_heap_size_executable` number\n* `total_physical_size` number\n* `total_available_size` number\n* `used_heap_size` number\n* `heap_size_limit` number\n* `malloced_memory` number\n* `peak_malloced_memory` number\n* `does_zap_garbage` number\n* `number_of_native_contexts` number\n* `number_of_detached_contexts` number\n\n`does_zap_garbage` is a 0/1 boolean, which signifies whether the `--zap_code_space` option is enabled or not. This makes V8 overwrite heap garbage with a bit pattern. The RSS footprint (resident memory set) gets bigger because it continuously touches all heap pages and that makes them less likely to get swapped out by the operating system.\n\n`number_of_native_contexts` The value of native_context is the number of the top-level contexts currently active. Increase of this number over time indicates a memory leak.\n\n`number_of_detached_contexts` The value of detached_context is the number of contexts that were detached and not yet garbage collected. This number being non-zero indicates a potential memory leak.\n\n<!-- eslint-skip -->",
        "parameters": [],
        "returns": {
          "collection": false,
          "type": "Object",
          "properties": [
            {
              "name": "total_heap_size",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "total_heap_size_executable",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "total_physical_size",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "total_available_size",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "used_heap_size",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "heap_size_limit",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "malloced_memory",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "peak_malloced_memory",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "does_zap_garbage",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "number_of_native_contexts",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            },
            {
              "name": "number_of_detached_contexts",
              "description": "",
              "required": true,
              "additionalTags": [],
              "collection": false,
              "type": "number"
            }
          ]
        },
        "additionalTags": []
      },

The latter output has a significantly larger amount of useful context extracted from the same Markdown.

Current Challenges and Potential Blockers

I would like to get a feeling for how folks feel about these potential technical/functional blockers.

[ ] Three elements of metadata that Node.js uses are currently missing from docs-parser: changes, introduced in, and stability.
- Potential solution: I've talked with @MarshallOfSound and it seems that there's potential for adding an extensible metadata section to docs-parser.
[ ] docs-parser does not currently output individual, per-API JSON files.
- Potential solution: this could be PR'ed or extracted in an additional step from the all-in-one file.
[ ] some elements of docs-parser are currently hardcoded to be electron-specific.
- Potential solution: electron/docs-parser#21, additional configuration by file that @MarshallOfSound said he'd be interested in shipping.
[ ] Bug in the multi-class mode which results in a parsing error in docs that have mutliple classes (@electronjs/docs-parser#27)
- Potential solution: slated to be fixed.

sam-github commented 4 years ago

I'm generally in favour of auto-generation from docs, but Node.js doc structure might be a bit further from Electron's than is workable. Perhaps a POC should look at the harder edges.

From a quick look at the style guilde, this seems a bit restrictive: https://github.com/electron/electron/blob/master/docs/styleguide.md#page-title

Not all APIs need requiring:

And then there are things that aren't "API"s in the js API sense, CLI options, C++ Addons, etc.

bnb commented 4 years ago

@sam-github did you take a look at the POC I linked? Here it is again: bnb/node-docs-parser ❤️

It is worth noting that my POC runs a multi-module mode that @MarshallOfSound implemented that addresses Node.js's needs around certain headings and multiple classes - this mode is not currently reflected in the Style Guide. As noted in the Current Challenges and Potential Blockers section, there's an open PR to decouple Electron-specific bits. In my POC, that specific require requirement you pointed to was not a problem as far as I could tell.

sam-github commented 4 years ago

did you take a look at the POC I linked?

Yes, which is why I pointed out both APIs are "nice" js APIs, with a well-defined require, but not all Node APIs are like that.

codebytere commented 4 years ago

@sam-github i see that but also i'm curious how you feel that relates to the above goal? As an native C++ API there wouldn't need to be types for it, would there? It would to my knowledge be exempted from docs-parser's style requirements, as they only apply to APIs for which type definitions would need to be created.

I also think that styleguide definitely isn't hard and fast and would accept alterations to fit different needs, so I don't think js APIs would need to be "nice" in order to fit into the end-product :)

sam-github commented 4 years ago

Not c++:

Not all APIs need requiring:

https://nodejs.org/api/globals.html

https://nodejs.org/api/errors.html

sam-github commented 4 years ago

Again, I'm not trying to discourage trying this, but I'm trying to point out that as a POC goes, choosing well-formed simple APIs like query_string, workers, and v8 isn't likely to find the stuff that's hard to work with.

APIs like the two I mentioned, globals and timers, or child_process, which has lots of interwoven examples and lots of args are more likely to stress the tooling. Or module, which does have a require("module"), but the page for it in the docs is mostly about how modules work and the APIs are not so prominent, so its not clear how that would work with the tooling. Split it into two pages, module for the API, and "module overview" for the more prose stuff?

Fwiw, I personally think the mapping of .md to .html to single .json is not necessary to maintain, if all the api json was one single file, that'd be fine (though perhaps a breaking change, to someone, somewhere).

bnb commented 4 years ago

@sam-github for Errors and Globals I have a much clearer idea of how they could be implemented in the POC - theoretically could be straightforward, but would be different than how it looks now. It would likely make use of the structures tooling provided by docs-parser that is unused in my POC (you can find Electron's usage of this feature here) plus moving some of the prose tutorial content into a tutorial directory. Will work on it in the coming days ❤️

mhdawson commented 4 years ago

@bnb how much help does the parser give in terms of what you need to update. Maybe you have it in the above but an example of the output on an existing Node.js file and what is reported by the parser might help illustrate the process of doing a conversion.

bnb commented 4 years ago

@mhdawson at present, the way I figured things out was by just peeking at the JSON to see if it output what was expected. There are a few errors it will throw if something goes really wrong, but currently AFAIK there is not direct error reporting on syntax errors. Personally, I found it relatively straightforward to convert the docs I did since the required structure is extremely consistent.

One of the things that I've had some light discussions about lately with the Electron folks is a markdown linter that enforces the style guide.

Theoretically this can be implemented using existing markdown tooling and leveraging their custom rules engines - using something like markdownlint is a possibility, and something I'd be happy to help explore as tooling that would mutually benefit both projects. I've already got a conversation going with @davidanson about the potential challenges of it, since the structure is much more... structured than the relative looseness of normal markdown ❤️

Additionally, it's worth noting that we could begin to implement the required style without implementing the tooling, and then implement the tooling.

DavidAnson commented 4 years ago

My initial thought after reading through the first time is that linting - as well as hinting in VS Code - could reasonably be implemented as one or more custom rules in markdownlint. (The header structure in particular has a lot in common with rule MD043.) This project might choose to turn off some of the other rules that aren't relevant, but may find value in enforcing maximum line lengths, etc.. It seems like there's a bigger discussion about whether the broader effort is practical and worthwhile, but the linting aspects appear to be something we could get working.

jasnell commented 3 years ago

@bnb ... where are things at with this? Should this remain open?

bnb commented 3 years ago

I'm happy to continue working on it. I've honestly not had a ton of time to continue working on it since some other commitments came up, but those literally ended yesterday. Happy to spend time next week fleshing out more examples outside of the basic ones I provided.

Also more than happy to have other folks help if they're interested - I can pair anytime on it, or help provide additional context.

bnb commented 3 years ago

As a note: I met with @Ethan-Arrowood yesterday and we discussed how to begin approaching this. We roughly agreed on the best path to adoption in Node.js would be to start with stripping out the styleguide parsing from docs-parser, making it an independent module, and introduce it as a linter to get all our docs uniform. This is based on a recommendation from this thread.

Electron's docs-parser could just consume that module (there likely won't be a barrier to consuming it from the electron side, presuming we build it in the Electron org which should be straightforward), making the linter and docs-parser uniform in their parsing.

From there, we can begin to adapt the Node.js docs to use docs-parser's JSON output both as our own source of truth for docs and to generate types if that's something we want to do.

bnb commented 3 years ago

As a note: help is appreciated if folks are interested in working on this with @Ethan-Arrowood and myself.

nodejs / node