jgm / cheapskate

Experimental markdown processor in Haskell
BSD 3-Clause "New" or "Revised" License
107 stars 18 forks source link

Cheapskate

Note: This library is unmaintained (by me anyway). I recommend using commonmark.

This is an experimental Markdown processor in pure Haskell. (A cheapskate is always in search of the best markdown.) It aims to process Markdown efficiently and in the most forgiving possible way. It is about seven times faster than pandoc and uses a fifth the memory. It is also faster, and considerably more accurate, than the markdown package on Hackage.

There is no such thing as an invalid Markdown document. Any string of characters is valid Markdown. So the processor should finish efficiently no matter what input it gets. Garbage in should not cause an error or exponential slowdowns. This processor has been tested on many large inputs consisting of random strings of characters, with performance that is consistently linear with the input size. (Try make fuzztest.)

Installing

To build, get the Haskell Platform, then:

cabal update && cabal install

This will install both the cheapskate executable and the Haskell library. A man page can be found in man/man1 in the source.

Usage

As an executable:

cheapskate [FILE*]

As a library:

import Cheapskate
import Text.Blaze.Html

toMarkdown :: Text -> Html
toMarkdown = toHtml . markdown def

If the markdown input you are converting comes from an untrusted source (e.g. a web form), you should always set sanitize to True. This causes the generated HTML to be filtered through xss-sanitize's sanitizeBalance function. Otherwise you risk a XSS attack from raw HTML or a markdown link or image attribute attribute.

You may also wish to disallow users from entering raw HTML for aesthetic, rather than security reasons. In that case, set allowRawHtml to False, but let sanitize stay True, since it still affects attributes coming from markdown links and images.

Manipulating the parsed document

You can manipulate the parsed document before rendering using the walk and walkM functions. For example, you might want to highlight code blocks using highlighting-kate:

import Data.Text as T
import Data.Text.Lazy as TL
import Cheapskate
import Text.Blaze.Html
import Text.Blaze.Html.Renderer.Text
import Text.Highlighting.Kate

markdownWithHighlighting :: Text -> Html
markdownWithHighlighting = toHtml . walk addHighlighting . markdown def

addHighlighting :: Block -> Block
addHighlighting (CodeBlock (CodeAttr lang _) t) =
  HtmlBlock (T.concat $ TL.toChunks
             $ renderHtml $ toHtml
             $ formatHtmlBlock defaultFormatOpts
             $ highlightAs (T.unpack lang) (T.unpack t))
addHighlighting x = x

Extensions

This processor adds the following Markdown extensions:

Hyperlinked URLs

All absolute URLs are automatically made into hyperlinks, where inside <> or not.

Fenced code blocks

Fenced code blocks with attributes are allowed. These begin with a line of three or more backticks or tildes, followed by an optional language name and possibly other metadata. They end with a line of backticks or tildes (the same character as started the code block) of at least the length of the starting line.

Explicit hard line breaks

A hard line break can be indicated with a backslash before a newline. The standard method of two spaces before a newline also works, but this gives a more "visible" alternative.

Backslash escapes

All ASCII symbols and punctuation marks can be backslash-escaped, not just those with a use in Markdown.

Revisions

In departs from the markdown syntax document in the following ways:

Intraword emphasis

Underscores cannot be used for word-internal emphasis. This prevents common mistakes with filenames, usernames, and indentifiers. Asterisks can still be used if word internal emphasis is needed.

The exact rule is this: an underscore that appears directly after an alphanumeric character does not begin an emphasized span. (However, an underscore directly before an alphanumeric can end an emphasized span.)

Ordered lists

The starting number of an ordered list is now significant. Other numbers are ignored, so you can still use 1. for each list item.

In addition to the 1. form, you can use 1) in your ordered lists. A new list starts if you change the form of the delimiter. So, the following is two lists:

1. one
2. two
1) one
2) two

Bullet lists

A new bullet lists starts if you change the bullet marker. So, the following is two consecutive bullet lists:

+ one
+ two
- one
- two

List separation

Two blank lines breaks out of a list. This allows you to have consecutive lists:

- one

- two

- one (new list)

The blank lines break out of a list no matter how deeply it is nested:

- one
  - two
    - three

  - new top-level list

Indentation of list continuations

Block elements inside list items need not be indented four spaces. If they are indented beyond the bullet or numerical list marker, they will be considered additional blocks inside the list item. So, the following is a list item with two paragraphs:

- one

 two

The amount of indentation required for an indented code block inside a list item depends on the first line of the list item. Generally speaking, code must be indented four spaces past the first non-space character after the list marker. Thus:

 -   My code

         {code here}

 - My code

       {code here}

The following diagram shows how the first line of a list item divides the following lines into three regions:

 -   My code
  |     |
  +-----+

Content to the left of the marked region will not be part of the list item. Content to the right of the marked region will be indented code under the list item. Regular blocks that belong under the list item should start inside the marked region.

When the first line itself contains indented code, this code and subsequent indented code blocks should be indented five spaces past the list marker:

 -     { code }

       { more code }

Raw HTML blocks

Raw HTML blocks work a bit differently than in Markdown.pl. A raw HTML block starts with a block-level HTML tag (opening or closing), or a comment start <!-- or end -->, and goes until the next blank line. The whole block is included as raw HTML. No attempt is made to parse balanced tags. This means that in the following, the asterisks are literal asterisks:

<div>
*hello*
</div>

while in the following, the asterisks are interpreted as markdown emphasis:

<div>

*hello*

</div>

In the first example, we have a single raw HTML block; in the second, we have two raw HTML blocks with an intervening paragraph. This system provides flexibility to authors to use enclose markdown sections in html block-level tags if they wish, while also allowing them to include verbatim HTML blocks (taking care that the don't include any blank lines).

As a consequence of this rule, HTML blocks may not contain blank lines.

Clarifications

This implementation resolves the following issues left vague in the markdown syntax document:

Tight vs. loose lists

A list is considered "tight" if (a) it has only one item or there is no blank space between any two consecutive items, and (b) no item has blank lines as its immediate children. If a list is "tight," then list items consisting of a single paragraph or a paragraph followed by a sublist will be rendered without <p> tags.

Sublists

Sublists work like other block elements inside list items; they must be indented past the bullet or numerical list marker (but no more than three spaces past, or they will be interpreted as indented code).

ATX headers

ATX headers must have a space after the initial ###s.

Separation of block quotes

A blank line will end a blockquote. So, the following is a single blockquote:

> hi
>
> there

But this is two blockquotes:

> hi

> there

Blank lines are not required before horizontal rules, blockquotes, lists, code blocks, or headers. They are not required after, either, though in many cases "laziness" will effectively require a blank line after. For example, in

Hello there.
> A quote.
Still a quote.

the "Still a quote." is part of the block quote, because of laziness (the ability to leave off the > from the beginning of subsequent lines). Laziness also affects lists. However, we can have a code block, ATX header, or horizontal rule between two paragraphs without any blank lines.

Link references

Link references may occur anywhere in the document, even in nested list contexts. They need not be at the outer level.

Tests

The tests subdirectory contains an extensive suite of tests, including all of John Gruber's original Markdown tests, plus many of the tests from Michel Fortin's mdtest suite. Each test consists in two files with the same basename, a markdown source and an expected HTML output.

To run the test suite, do

make test

To run only tests that match a regex pattern, do

PATT=Orig make test

Setting the environment variable TIDY=1 will run the expected and actual output through tidy before comparing them. You can run this test suite on another markdown processor by doing

PROG=myothermarkdown make test

Benchmarks

To run a crude benchmark comparing cheapskate to pandoc, do make bench. Set the BENCHPROGS environment variable to compare to other implementations.

License

Copyright © 2012, 2013, 2014 John MacFarlane.

The library is released under the BSD license; see LICENSE for terms.

Some of the test cases are borrowed from Michel Fortin's mdtest suite and John Gruber's original markdown test suite.