unifiedjs / handbook

Archived: see https://unifiedjs.com for current docs, guides, articles.
https://unifiedjs.com
MIT License
83 stars 1 forks source link

unified handbook

:warning: This is a work in progress

The compiler for your content.

This handbook describes the unified ecosystem. It goes in depth about the numerous syntaxes it supports, usage, and practical guides on writing plugins. Additionally, it will attempt to define murky, computer science-y concepts that unified attempts to abstract away.

Table of contents

Introduction

unified enables new exciting projects like Gatsby to pull in Markdown, MDX to embed JSX, and Prettier to format it. It’s used in about 300k projects on GitHub and has about 10m downloads each month on npm: you’re probably using it.

It powers remarkjs, rehypejs, mdx-js, retextjs, and redotjs. It's used to build other projects like prettier, gatsbyjs, and more.

Some notable users are Node.js, ZEIT, Netlify, GitHub, Mozilla, WordPress, Adobe, Facebook, Google.

How does it work?

unified uses abstract syntax trees, or ASTs, that plugins can operate on. It can even process between different formats. This means you can parse a markdown document, transform it to HTML, and then transpile back to markdown.

unified leverages a syntax tree specification (called unist or UST) so that utilities can be shared amongst different formats. In practice, you can use unist-util-visit to visit nodes using the same library with the same API on any supported AST.

visit(markdownAST, 'images', transformImages)
visit(htmlAST, 'img', transformImgs)

Supported syntaxes

unified supports a few different syntaxes. Each have their own formal specification and are compatible with all unist utility libraries.

Each syntax has its own GitHub organization and subset of plugins and libraries.

Abstract syntax trees

An abstract syntax tree, or AST, is a representation of input. It's an abstraction that enables developers to analyze, transform and generate code.

They're the integral data structure in the unified ecosystem. Most plugins operate solely on the AST, receiving it as an argument and then returning a new AST afterwards.

Your most basic plugin looks like the following (where the tree is an AST):

module.exports = options => tree => {
  return tree
}

It accepts the AST as an argument, and then returns it. You can make it do something slightly more interesting by counting the heading nodes.

const visit = require('unist-util-visit')

module.exports = options => tree => {
  let headingsCount = 0

  visit(tree, 'heading', node => {
    headingsCount++
  })
}

Or, turn all h1s in a document into h2s:

const visit = require('unist-util-visit')

module.exports = options => tree => {
  visit(tree, 'heading', node => {
    if (node.depth === 1) {
      node.depth = 2
    }
  })
}

If you ran the plugin above with # Hello, world! and compiled it back to markdown, the output would be ## Hello, world!.

unified uses ASTs because plugins are much easier to write when operating on objects rather than the strings themselves. You could achieve the same result with a string replacement:

markdown.replace(/^#\s+/g, '## ')

But this would be brittle and doesn't handle the thousands of edge cases with complex grammars which make up the syntax of markdown, HTML, and MDX.

Constructing an AST

In order to form an AST, unified takes an input string and passes that to a tokenizer. A tokenizer breaks up the input into tokens based on the syntax. In unified the tokenizer and lexer are coupled. When syntax is found the string is "eaten" and it's given metadata like node type (this is the "lexer").

Then, the parser turns this information into an AST. All together the pipeline looks like:

[INPUT] => [TOKENIZER/LEXER] => [PARSER] => [AST]

Parse example

Consider this markdown input:

# Hello, **world**!

The tokenizer will match the "#" and create a heading node. Then it will begin searching for inline syntax where it will encounter "**" and create a strong node.

It's important to note that the parser first looks for block-level syntax which includes headings, code blocks, lists, paragraphs, and block quotes.

Once a block has been opened, inline tokenization begins which searches for syntax including bold, code, emphasis, and links.

The markdown will result in the following AST:

{
  "type": "heading",
  "depth": 1,
  "children": [
    {
      "type": "text",
      "value": "Hello, ",
      "position": {}
    },
    {
      "type": "strong",
      "children": [
        {
          "type": "text",
          "value": "world",
          "position": {}
        }
      ],
      "position": {}
    },
    {
      "type": "text",
      "value": "!",
      "position": {}
    }
  ],
  "position": {}
}

A compiler turns an AST into output (typically a string). It provides functions that handle each node type and compiles them to the desired end result.

For example, a compiler for markdown would encounter a link node and transform it into []() markdown syntax.

[AST] => [COMPILER] => [OUTPUT]

It would turn the AST example above back into the source markdown when compiling to markdown. It could also be compiled to HTML and would result in:

<h1>
  Hello, <strong>world</strong>!
</h1>

unist

unist is a specification for syntax trees which ensures that libraries that work with unified are as interoperable as possible. All ASTs in unified conform to this spec. It's the bread and butter of the ecosystem.

Motivation

A standard AST allows developers to use the same visitor function on all formats, whether it's markdown, HTML, natural language, or MDX. Using the same library ensures that the core functionality is as solid as possible while cutting down on cognitive overhead when trying to perform common tasks.

Visitors

When working with ASTs it's common to need to traverse the tree. This is typically referred to as "visiting". A handler for a particular type of node is called a "visitor".

unified comes with visitor utilities so you don't have to reinvent the wheel every time you want to operate on particular nodes.

unist-util-visit

unist-util-visit is a library that improves the DX of tree traversal for unist trees. It's a function that takes a tree, a node type, and a callback which it invokes with any matching nodes that are found.

visit(tree, 'image', node => {
  console.log(node)
})

Note: This performs a depth-first tree traversal in preorder (NLR).

Visit nodes based on context

Something that's useful with unist utilities is that they can be used on subtrees. A subtree would be any node in the tree that may or may not have children.

For example if you only wanted to visit images within heading nodes you could first visit headings, and then visit images contained within each heading node you encounter.

visit(tree, 'heading', headingNode => {
  visit(headingNode, 'image', node => {
    console.log(node)
  })
})

unist-util-remove

Advanced operations

Once you're familiar with some of the primary unist utilities, you can combine them together to address more complex needs.

Optimizing traversal

When you care about multiple node types and are operating on large documents it might be preferable to walk all nodes and add a check for each node type with unist-util-is.

Removing nodes based on parent context

In some cases you might want to remove nodes based on their parent context. Consider a scenario where you want to remove all images contained within a heading.

You can achieve this by combining unist-util-visit with unist-util-remove. The idea is that you first visit the parent, which would be heading nodes, and then remove images from the subtree.

visit(tree, 'heading', headingNode => {
  remove(headingNode, 'image')
})

Watch this lesson on egghead →

unist resources

unified

unified is the interface for working with syntax trees and can be used in the same way for any of the supported syntaxes.

For unified to work it requires two key pieces: a parser and a compiler.

Parser

A parser takes a string and tokenizes it based on syntax. A markdown parser would turn # Hello, world! into a heading node.

unified has a parser for each of its supported syntax trees.

Compiler

A compiler turns an AST into its "output". This is typically a string. In some cases folks want to parse a markdown document, transform it, and then write back out markdown (like Prettier). In other cases folks might want to turn markdown into HTML.

unified already supports compilers for most common outputs including markdown, HTML, text, and MDX. It even offers compilers for less common use cases including compiling markdown to CLI manual pages.

Transpiler

unified also offers transpilers. This is how one syntax tree is converted to another format. The most common transpiler is mdast-util-to-hast which converts the markdown AST (mdast) to the HTML AST (hast).

Usage

unified should be invoked:

unified()

Passed plugins:

.use(remarkParse)

And then given a string to operate on:

.process('# Hello, world!', (err, file) => {
  console.log(String(file))
})

A more real-world example might want to turn a markdown document into an HTML string which would look something like:

var unified = require('unified')
var markdown = require('remark-parse')
var remark2rehype = require('remark-rehype')
var doc = require('rehype-document')
var format = require('rehype-format')
var html = require('rehype-stringify')
var report = require('vfile-reporter')

unified()
  .use(markdown)
  .use(remark2rehype)
  .use(doc, {title: '👋🌍'})
  .use(format)
  .use(html)
  .process('# Hello world!', function(err, file) {
    console.error(report(err || file))
    console.log(String(file))
  })

The code is doing the following

It'll result in an HTML string:

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>👋🌍</title>
    <meta name="viewport" content="width=device-width, initial-scale=1">
  </head>
  <body>
    <h1>Hello world!</h1>
  </body>
</html>

remark

remark is a plugin-based markdown processor. It has the ability to parse markdown, transform it with plugins, and then write back to markdown or transpile it to another format like HTML.

It's highly configurable. Even plugins can customize the parser and compiler if needed.

You can use the remark library directly in your scripts:

remark()
  .processSync('# Hello, world!')

Though, it's really a shortcut for:

unified()
  .use(remarkParse)
  .use(remarkStringify)
  .processSync('# Hello, world!')

remark CLI

remark offers a CLI that which can be used to automate tasks.

Inspect

A useful option with the remark CLI is inspecting the AST of a document. This can be useful when you're trying to remember the name of a node type or you want an overview of the overall structure.

❯ remark doc.md --inspect
root[13] (1:1-67:1, 0-2740)
├─ paragraph[1] (1:1-1:64, 0-63)
│  └─ text: "import TableOfContents from '../src/components/TableOfContents'" (1:1-1:64, 0-63)
├─ heading[1] (3:1-3:15, 65-79) [depth=1]
│  └─ text: "Fecunda illa" (3:3-3:15, 67-79)
├─ html: "<TableOfContents headings={props.headings} />" (5:1-5:46, 81-126)
├─ heading[1] (7:1-7:18, 128-145) [depth=2]
│  └─ text: "Sorore extulit" (7:4-7:18, 131-145)
├─ paragraph[1] (9:1-12:75, 147-454)
│  └─ text: "Lorem markdownum sorore extulit, non suo putant tritumque amplexa silvis: in,\nlascivaque femineam ara etiam! Oppida clipeus formidine, germanae in filia\netiamnunc demisso visa misce, praedaeque protinus communis paverunt dedit, suo.\nSertaque Hyperborea eatque, sed valles novercam tellure exhortantur coegi." (9:1-12:75, 147-454)
├─ list[3] (14:1-16:58, 456-573) [ordered=true][start=1][spread=false]
│  ├─ listItem[1] (14:1-14:22, 456-477) [spread=false]
│  │  └─ paragraph[1] (14:4-14:22, 459-477)
│  │     └─ text: "Cunctosque plusque" (14:4-14:22, 459-477)
│  ├─ listItem[1] (15:1-15:38, 478-515) [spread=false]
│  │  └─ paragraph[1] (15:4-15:38, 481-515)
│  │     └─ text: "Cum ego vacuas fata nolet At dedit" (15:4-15:38, 481-515)
│  └─ listItem[1] (16:1-16:58, 516-573) [spread=false]
│     └─ paragraph[1] (16:4-16:58, 519-573)
│        └─ text: "Nec legerat ostendisse ponat sulcis vincirem cinctaque" (16:4-16:58, 519-573)

Use a plugin

You can use plugins with the CLI:

remark doc.md --use toc

This will output a markdown string with a table of contents added. If you'd like, you can overwrite the document with the generated table of contents:

remark doc.md -o --use toc

Lint

You can use a lint preset to ensure your markdown style guide is adhered to:

❯ remark doc.md --use preset-lint-markdown-style-guide

  15:1-15:38  warning  Marker should be `1`, was `2`  ordered-list-marker-value  remark-lint
  16:1-16:58  warning  Marker should be `1`, was `3`  ordered-list-marker-value  remark-lint
   34:1-60:6  warning  Code blocks should be fenced   code-block-style           remark-lint

⚠ 4 warnings

If you want to exit with a failure code (1) when the lint doesn't pass you can use the --frail option:

❯ remark doc.md --frail --use preset-lint-markdown-style-guide || echo '!!!failed'

  15:1-15:38  warning  Marker should be `1`, was `2`  ordered-list-marker-value  remark-lint
  16:1-16:58  warning  Marker should be `1`, was `3`  ordered-list-marker-value  remark-lint
   34:1-60:6  warning  Code blocks should be fenced   code-block-style           remark-lint

⚠ 4 warnings
!!!failed

Watch a video introduction to the CLI →

remark guides

Writing a plugin to modify headings

unist-util-visit is useful for visiting nodes in an AST based on a particular type. To visit all headings you can use it like so:

module.exports = () => tree => {
  visit(tree, 'heading', node => {
    console.log(node)
  })
}

The above will log all heading nodes. Heading nodes also have a depth field which indicates whether it's h1-h6. You can use that to narrow down what heading nodes you want to operate on.

Below is a plugin that prefixes "BREAKING" to all h1s in a markdown document.

const visit = require('unist-util-visit')

module.exports = () => tree => {
  visit(tree, 'heading', node => {
    if (node.depth !== 1) {
      return
    }

    visit(node, 'text', textNode => {
      textNode.value = 'BREAKING ' + textNode.value
    })
  })
}

Watch the lesson on egghead →

rehype

rehype is an HTML processor in the same way that remark is for markdown.

rehype()
  .processSync('<title>Hi</title><h2>Hello world!')

retext

MDX

MDX is a syntax and language for embedding JSX in markdown. It allows you to embed components in your documents for writing immersive and interactive content.

An example MDX document looks like:

import Chart from '../components/snowfall-chart'

# Last year's snowfall

In the winter of2018, the snowfall was above average. It was followed by
a warm spring which caused flood conditions in many of the nearby rivers.

<SnowfallChart year="2018" />

The MDX core library extends the remark parser with the remark-mdx plugin in order to define its own JSX-enabled syntax.

MDX transpilation pipeline

MDX uses remark and rehype internally. The flow of MDX consists of the following six steps:

  1. Parse: MDX text => MDAST
  2. Transpile: MDAST => MDXAST (remark-mdx)
  3. Transform: remark plugins applied to AST
  4. Transpile: MDXAST => MDXHAST
  5. Transform: rehype plugins applied to AST
  6. Generate: MDXHAST => JSX text

The final result is JSX that can be used in React/Preact/Vue/etc.

MDX allows you to hook into this flow at step 3 and 5, where you can use remark and rehype plugins (respectively) to benefit from their ecosystems.

Tree traversal

Tree traversal is a common task when working with a tree to search it. Tree traversal is typically either breadth-first or depth-first.

In the following examples, we’ll work with this tree:

                 +---+
                 | A |
                 +-+-+
                   |
             +-----+-----+
             |           |
           +-+-+       +-+-+
           | B |       | F |
           +-+-+       +-+-+
             |           |
    +-----+--+--+        |
    |     |     |        |
  +-+-+ +-+-+ +-+-+    +-+-+
  | C | | D | | E |    | G |
  +---+ +---+ +---+    +---+

Breadth-first traversal

Breadth-first traversal is visiting a node and all its siblings to broaden the search at that level, before traversing children.

For the syntax tree defined in the diagram, a breadth-first traversal first searches the root of the tree (A), then its children (B and F), then their children (C, D, E, and G).

Depth-first traversal

Alternatively, and more commonly, depth-first traversal is used. The search is first deepened, by traversing children, before traversing siblings.

For the syntax tree defined in the diagram, a depth-first traversal first searches the root of the tree (A), then one of its children (B or F), then their children (C, D, and E, or G).

For a given node N with children, a depth-first traversal performs three steps, simplified to only binary trees (every node has head and tail, but no other children):

These steps can be done in any order, but for non-binary trees, L and R occur together. If L is done before R, the traversal is called left-to-right traversal, otherwise it is called right-to-left traversal. In the case of non-binary trees, the other children between head and tail are processed in that order as well, so for left-to-right traversal, first head is traversed (L), then its next sibling is traversed, etcetera, until finally tail (R) is traversed.

Because L and R occur together for non-binary trees, we can produce four types of orders: NLR, NRL, LRN, RLN.

NLR and LRN (the two left-to-right traversal options) are most commonly used and respectively named preorder and postorder.

For the syntax tree defined in the diagram, preorder and postorder traversal thus first search the root of the tree (A), then its head (B), then its children from left-to-right (C, D, and then E). After all descendants of B are traversed, its next sibling (F) is traversed and then finally its only child (G).

Glossary

Tree

A tree is a node and all of its descendants (if any).

Child

Node X is child of node Y, if Y’s children include X.

Parent

Node X is parent of node Y, if Y is a child of X.

Index

The index of a child is its number of preceding siblings, or 0 if it has none.

Sibling

Node X is a sibling of node Y, if X and Y have the same parent (if any).

The previous sibling of a child is its sibling at its index minus 1.

The next sibling of a child is its sibling at its index plus 1.

Root

The root of a node is itself, if without parent, or the root of its parent.

The root of a tree is any node in that tree without parent.

Descendant

Node X is descendant of node Y, if X is a child of Y, or if X is a child of node Z that is a descendant of Y.

An inclusive descendant is a node or one of its descendants.

Ancestor

Node X is an ancestor of node Y, if Y is a descendant of X.

An inclusive ancestor is a node or one of its ancestors.

Head

The head of a node is its first child (if any).

Tail

The tail of a node is its last child (if any).

Leaf

A leaf is a node with no children.

Branch

A branch is a node with one or more children.

Generated

A node is generated if it does not have positional information.

Type

The type of a node is the value of its type field.

Positional information

The positional information of a node is the value of its position field.

File

A file is a source document that represents the original file that was parsed to produce the syntax tree. Positional information represents the place of a node in this file. Files are provided by the host environment and not defined by unist.

For example, see projects such as vfile.

Preorder

In preorder (NLR) is [depth-first][traversal-depth] [tree traversal][traversal] that performs the following steps for each node N:

  1. N: visit N itself
  2. L: traverse head (then its next sibling, recursively moving forward until reaching tail)
  3. R: traverse tail

Postorder

In postorder (LRN) is [depth-first][traversal-depth] [tree traversal][traversal] that performs the following steps for each node N:

  1. L: traverse head (then its next sibling, recursively moving forward until reaching tail)
  2. R: traverse tail
  3. N: visit N itself

Enter

Enter is a step right before other steps performed on a given node N when [traversing][traversal] a tree.

For example, when performing preorder traversal, enter is the first step taken, right before visiting N itself.

Exit

Exit is a step right after other steps performed on a given node N when [traversing][traversal] a tree.

For example, when performing preorder traversal, exit is the last step taken, right after traversing the tail of N.

Collective

unified was originally created by Titus Wormer. It's now governed by a collective which handles the many GitHub organizations, repositories, and packages that are part of the greater unified ecosystem.

The collective and its governance won't be addressed in this handbook. If you're interested, you can read more about the collective on GitHub.

Authors

Additional resources

Acknowledgements

This handbook is inspired by the babel-handbook written by James Kyle.

License

MIT

Notes