Introduce automated documentation linting #643

Open StevenMaude opened 2 years ago

Goal :checkered_flag:

Improve consistency across the documentation for both readers and writers. Or at least highlight the inconsistencies.

This suggestion is prompted:

by @lucyb's mention of the alex linter
by my reading around "docs as code" tooling

Tooling :hammer:

There are several linting tools out there. But, Vale allows you to apply different style rules or guides taken from various different linters. You can also write your own rules in YAML.

Vale aside: there are lots of other potential tools around. But Vale is mentioned a lot.

How do other people work? :eyes:

For example, GitLab's technical writers use:

Text content and writing style: markdownlint, Vale

Text formatting: Markdownlint, yamllint

Link validity: nanoc

File permissions and naming: lint-doc.sh

I'm more interested in the stages, than the specific tools chosen by the GitLab team. We don't have any of these right now. I also opened #642 for link checking as I believe that's a self-contained problem.

Note that tools like markdownlint are actually focused on markup consistency, not content consistency. This might be another benefit. We — understandably — have a mixture of Markdown styles across our documentation due to different author contributions.

Considerations :thought_balloon:

Is this a good idea, in principle?
Are the checks useful or distractingly noisy?
Would this add friction for less regular contributors?
Should failed linting cause an outright failure of a build, or just alert "content managers" to this to later fix up? If failing, what types of lint failures should fail a build?
What checks do we want (relates to #519)?
Are there issues of having consistency across imported content? We include content in the documentation from the source of cohort-extractor and Data Builder. This content may not be subject to the same checks.

I've been using codespell on a personal project, unlike a spell checker it catches common misspellings. I've given it a quick go with ehr and ons ignored (lots of false positives!) it gives us output like this:

In my own project I'm running it as a pre-commit hook, but it could easily be a CI step too.

I've found it useful since my project has lots of non-dictionary words so standard spell checking is far too noisy.

I ran codespell against the current docs and did fix some things as a result, so I'm for adding that in this repository too sometime.

That's a nice simpler intermediate step over adding a full spellcheck.

It may be worth considering using the GitHub Action, because of the nice annotations; example: codespell-project/actions-codespell#16.

Markdown linting tools

https://github.com/remarkjs/remark-lint (Node)
https://github.com/DavidAnson/markdownlint (Node)
https://github.com/markdownlint/markdownlint (Ruby)

Features

I think both markdownlint packages have the same rules available, so we'd probably be leaning towards one of the Node packages over Ruby. Remark does have some rules that are only available there.

Any of these tools gives you a rule set [^1] to enable or disable to configure your own style. Rules are a mixture of very debatable suggestions (for example, list bullet style) and suggestions that likely make universal sense and are mistakes by the writer (for example, heading levels shouldn't be skipped; brackets should come before parentheses in link syntax). You can also ignore warnings for false positives.

There is a little bit of overlap of some of these rules and some of what Vale does. For example, you can check capitalisation of proper nouns with the Node-based markdownlint, which you can also do with Vale. I think the preference would be to keep the Markdown linter's scope to specifically checking the Markdown.

Process of trialling

Enabling one of these tools is probably a process of:

deciding on a tool to try
picking a rule
linting against the current docs Markdown for a specific rule
fixing up any issues with that rule
optionally repeating the "pick a rule, lint, fix source" process for a few more rules
deciding whether to warn or fail in a PR. There are Actions out there which give you annotations, much like the codespell example above.

We'd probably want to warn first, and see how useful the checks are, before enforcing.

Questions

I'm not sure if/how you can enforce a style across a set of documents. Some of the rule checks check for consistent style — for example, list bullet points — within a single document.

Benefits

Catch subtle mistakes (for example, broken links)
Highlight text edits that may be being made only when they are spotted, and prevent new edits of this kind being introduced (for example, extraneous blank lines or trailing whitespace)
Ensure more consistency between authors, making it easier to read the source, and know that the intended style is that of the surrounding text
Reduce work for reviewers slightly, who shouldn't need to catch or highlight these routine edit mistakes

[^1]: Rule sets:

Here are some ideas for suggestions we could provide with Vale.

opensafely / documentation