mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
755 stars 138 forks source link
c commonmark markdown markdown-parser mit-license parser

MD4C Readme

MD4C stands for "Markdown for C" and that's exactly what this project is about.

What is Markdown

In short, Markdown is the markup language this README.md file is written in.

The following resources can explain more if you are unfamiliar with it:

What is MD4C

MD4C is Markdown parser implementation in C, with the following features:

Using MD4C

Parsing Markdown

If you need just to parse a Markdown document, you need to include md4c.h and link against MD4C library (-lmd4c); or alternatively add md4c.[hc] directly to your code base as the parser is only implemented in the single C source file.

The main provided function is md_parse(). It takes a text in the Markdown syntax and a pointer to a structure which provides pointers to several callback functions.

As md_parse() processes the input, it calls the callbacks (when entering or leaving any Markdown block or span; and when outputting any textual content of the document), allowing application to convert it into another format or render it onto the screen.

Converting to HTML

If you need to convert Markdown to HTML, include md4c-html.h and link against MD4C-HTML library (-lmd4c-html); or alternatively add the sources md4c.[hc], md4c-html.[hc] and entity.[hc] into your code base.

To convert a Markdown input, call md_html() function. It takes the Markdown input and calls the provided callback function. The callback is fed with chunks of the HTML output. Typical callback implementation just appends the chunks into a buffer or writes them to a file.

Markdown Extensions

The default behavior is to recognize only Markdown syntax defined by the CommonMark specification.

However, with appropriate flags, the behavior can be tuned to enable some extensions:

Few features of CommonMark (those some people see as mis-features) may be disabled with the following flags:

Input/Output Encoding

The CommonMark specification declares that any sequence of Unicode code points is a valid CommonMark document.

But, under a closer inspection, Unicode plays any role in few very specific situations when parsing Markdown documents:

  1. For detection of word boundaries when processing emphasis and strong emphasis, some classification of Unicode characters (whether it is a whitespace or a punctuation) is needed.

  2. For (case-insensitive) matching of a link reference label with the corresponding link reference definition, Unicode case folding is used.

  3. For translating HTML entities (e.g. &) and numeric character references (e.g. # or ಫ) into their Unicode equivalents.

    However note MD4C leaves this translation on the renderer/application; as the renderer is supposed to really know output encoding and whether it really needs to perform this kind of translation. (For example, when the renderer outputs HTML, it may leave the entities untranslated and defer the work to a web browser.)

MD4C relies on this property of the CommonMark and the implementation is, to a large degree, encoding-agnostic. Most of MD4C code only assumes that the encoding of your choice is compatible with ASCII. I.e. that the codepoints below 128 have the same numeric values as ASCII.

Any input MD4C does not understand is simply seen as part of the document text and sent to the renderer's callback functions unchanged.

The two situations (word boundary detection and link reference matching) where MD4C has to understand Unicode are handled as specified by the following preprocessor macros (as specified at the time MD4C is being built):

Documentation

The API of the parser is quite well documented in the comments in the md4c.h. Similarly, the markdown-to-html API is described in its header md4c-html.h.

There is also project wiki which provides some more comprehensive documentation. However note it is incomplete and some details may be somewhat outdated.

FAQ

Q: How does MD4C compare to other Markdown parsers?

A: Some other implementations combine Markdown parser and HTML generator into a single entangled code hidden behind an interface which just allows the conversion from Markdown to HTML. They are often unusable if you want to process the input in any other way.

Second, most parsers (if not all of them; at least within the scope of C/C++ language) are full DOM-like parsers: They construct abstract syntax tree (AST) representation of the whole Markdown document. That takes time and it leads to bigger memory footprint.

Building AST is completely fine as long as you need it. If you don't, there is a very high chance that using MD4C will be substantially faster and less hungry in terms of memory consumption.

Last but not least, some Markdown parsers are implemented in a naive way. When fed with a smartly crafted input pattern, they may exhibit quadratic (or even worse) parsing times. What MD4C can still parse in a fraction of second may turn into long minutes or possibly hours with them. Hence, when such a naive parser is used to process an input from an untrusted source, the possibility of denial-of-service attacks becomes a real danger.

A lot of our effort went into providing linear parsing times no matter what kind of crazy input MD4C parser is fed with. (If you encounter an input pattern which leads to a sub-linear parsing times, please do not hesitate and report it as a bug.)

Q: Does MD4C perform any input validation?

A: No. And we are proud of it. :-)

CommonMark specification states that any sequence of Unicode characters is a valid Markdown document. (In practice, this more or less always means UTF-8 encoding.)

In other words, according to the specification, it does not matter whether some Markdown syntax construction is in some way broken or not. If it's broken, it won't be recognized and the parser should see it just as a verbatim text.

MD4C takes this a step further: It sees any sequence of bytes as a valid input, following completely the GIGO philosophy (garbage in, garbage out). I.e. any ill-formed UTF-8 byte sequence will propagate to the respective callback as a part of the text.

If you need to validate that the input is, say, a well-formed UTF-8 document, you have to do it on your own. The easiest way how to do this is to simply validate the whole document before passing it to the MD4C parser.

License

MD4C is covered with MIT license, see the file LICENSE.md.

Links to Related Projects

Ports and bindings to other languages:

Software using MD4C: