This patch adds a number of functions that can be called on the hoedown document from the callbacks of a hoedown renderer, to get additional context. The motivation for most of these is to provide enough information to reconstruct the essence of the original Markdown, so that Markdown to Markdown renderers can be written to do things like simple transformations or formatting.
Some of this context would have been better provided as arguments to the relevant callbacks, but given that renderers out of my control have been written against the callback API, I decided it was easiest to add these as functions on the hoedown document that is available to the callbacks.
This patch was created while working for Google. I'm required by Google policy to add a Google copyright line in order to contribute, reflecting its contributions. Hope this isn't an issue.
Added functions
hoedown_document_link_id
This function is used in the context of link, image, and footnote references (important: NOT reference definitions). It provides a hoedown_buffer containing the id of the reference. e.g. for "[foo][bar]\n\n[bar]: /url/" this would provide "bar" while parsing "[foo][bar]". This is important so that we can reproduce links as they were originally written. The existing link callback only provides enough information to produce inline links.
hoedown_document_footnote_id
This function is used in the context of footnote definitions (i.e. from any callback originating from text inside a footnote definition). It provides a hoedown_buffer containing the id of the defined footnote. This is important in order to reproduce the original footnote definition (since the footnote_def callback is only passed a number, not an id), as well as to allow callbacks within the footnote to know how large the footnote prefix is (for example, to allow wrapping of paragraphs, taking into account the length of the id).
hoedown_document_is_escaped
This function is used in the context of normal text. It returns 1 if the currently processed buffer is escaped (i.e. was preceded by "\" in the original document), and 0 otherwise. To understand why this is necessary, consider this Markdown snippet: "\==a=b==". The initial slash is necessary, because it prevents highlighting, and renders literal slashes instead. However, none of the other "="s need to be escaped. Here's what's passed to NormalText (which is called 7 times):
""
"="
""
"=a"
"=b"
"="
"="
It's not immediately clear from this which "=" needs to be escaped in order to render the original. You could probably just escape them all and have it render to the same thing (although this assumption may not hold in all cases), but this would create the godawful "\=\=a\=b\=\=" string. The original is much nicer.
You could in theory at some higher level (like block level elements) implement a minimalist context-aware escaper that understands enough about escaping to know that just the first "=" needs to be escaped but this is hard and error prone in general.
I decided the best signal of what needs to be escaped is simply what the original document had escaped.
hoedown_document_header_type
This function is used in the context of headers. This returns an enum value indicating which type of header (setext or atx) this is. This is necessary to reproduce the style used in the original document.
hoedown_document_link_type
This function is used in the context of links. This returns an enum value indicating the type of link this is. The options are inline (e.g. "[foo](/url/)"), reference (e.g. "[foo][bar]"), shortcut (e.g. "[foo]"), and empty reference (e.g. "[foo][]"). This is necessary to easily reproduce a semantically equivalent document to the original, because link is not passed enough information to do this, even with the aid of hoedown_document_link_id, since [foo] and [foo][] have the same id ("foo") but are not semantically equivalent (they have different behavior when followed by "(bar)", for example).
hoedown_document_list_depth
This function is used in the context any callback. It returns the current depth of the list containing the element that triggered the callback. For example, in this Markdown:
a
* b
* c
“a” would have list depth 0, “b” would have list depth 1, and “c” would have list depth 2.
This is important because it allows depth-aware wrapping of e.g. paragraphs, and makes possible some implementations of depth-aware indentation of list items (although it’s not strictly required for this).
hoedown_document_blockquote_depth
This function is used in the context of any callback. It returns the current depth of the blockquote containing the element that triggered the callback. For example, in this Markdown:
a
> b
>
> > c
“a” would have blockquote depth 0, “b” would have blockquote depth 1, and “c” would have blockquote depth 2.
This is important because it allows depth-aware wrapping of e.g. paragraphs.
hoedown_document_ul_item_char
This function is used in the context of a list item callback. It returns the character used for the unordered list (“+”, “*”, or “-”). This is important because it allows reproducing the original document more faithfully.
hoedown_document_hrule_char
This function is used in the context of an hrule callback. It returns the character used for the hrule (“-”, “*”, or “_”). This is important because it allows reproducing the original document more faithfully.
hoedown_document_fencedcode_char
This function is used in the context of a code block callback. It returns the character used for the fenced code block (“`”, “~”, or 0 if it’s not a fenced code block). This is important because it allows reproducing the original document more faithfully, and makes it much, much easier to avoid producing invalid output (e.g. you can nest “```” inside a “~” codeblock, so it’s important to know that the outer codeblock uses “~”).
Added callbacks
ref
This callback is called when a link reference definition is parsed. It’s passed a hoedown_buffer containing the original text, along with an offset and size into that hoedown_buffer. This is important because there is no other way for a hoedown_renderer to see link reference definitions (the renderer basically sees all links as inline links). This allows these definitions to be reproduced in the output.
footnote_ref_def
This callback is called when a footnote reference definition is parsed. It’s passed a hoedown_buffer containing the original text, along with an offset and size into that hoedown_buffer. This is important because there is no other way for a hoedown_renderer to see duplicate footnote reference definitions (the footnote_def callback is only called for the first footnote definition for each id). Duplicates are usually a user error, but it’s nice to be able to reproduce them anyway.
Testing
These additions do not make a difference to the ordinary HTML renderer, so they are tested slightly differently. I've implemented a "context_test" renderer that renders to text information from the functions and callbacks above. Tests corresponding to this reside in test/Tests/context.
Summary
This patch adds a number of functions that can be called on the hoedown document from the callbacks of a hoedown renderer, to get additional context. The motivation for most of these is to provide enough information to reconstruct the essence of the original Markdown, so that Markdown to Markdown renderers can be written to do things like simple transformations or formatting.
Some of this context would have been better provided as arguments to the relevant callbacks, but given that renderers out of my control have been written against the callback API, I decided it was easiest to add these as functions on the hoedown document that is available to the callbacks.
This patch was created while working for Google. I'm required by Google policy to add a Google copyright line in order to contribute, reflecting its contributions. Hope this isn't an issue.
Added functions
hoedown_document_link_id
This function is used in the context of link, image, and footnote references (important: NOT reference definitions). It provides a hoedown_buffer containing the id of the reference. e.g. for "[foo][bar]\n\n[bar]: /url/" this would provide "bar" while parsing "[foo][bar]". This is important so that we can reproduce links as they were originally written. The existing link callback only provides enough information to produce inline links.
hoedown_document_footnote_id
This function is used in the context of footnote definitions (i.e. from any callback originating from text inside a footnote definition). It provides a hoedown_buffer containing the id of the defined footnote. This is important in order to reproduce the original footnote definition (since the footnote_def callback is only passed a number, not an id), as well as to allow callbacks within the footnote to know how large the footnote prefix is (for example, to allow wrapping of paragraphs, taking into account the length of the id).
hoedown_document_is_escaped
This function is used in the context of normal text. It returns 1 if the currently processed buffer is escaped (i.e. was preceded by "\" in the original document), and 0 otherwise. To understand why this is necessary, consider this Markdown snippet: "\==a=b==". The initial slash is necessary, because it prevents highlighting, and renders literal slashes instead. However, none of the other "="s need to be escaped. Here's what's passed to NormalText (which is called 7 times):
"" "=" "" "=a" "=b" "=" "="
It's not immediately clear from this which "=" needs to be escaped in order to render the original. You could probably just escape them all and have it render to the same thing (although this assumption may not hold in all cases), but this would create the godawful "\=\=a\=b\=\=" string. The original is much nicer.
You could in theory at some higher level (like block level elements) implement a minimalist context-aware escaper that understands enough about escaping to know that just the first "=" needs to be escaped but this is hard and error prone in general.
I decided the best signal of what needs to be escaped is simply what the original document had escaped.
hoedown_document_header_type
This function is used in the context of headers. This returns an enum value indicating which type of header (setext or atx) this is. This is necessary to reproduce the style used in the original document.
hoedown_document_link_type
This function is used in the context of links. This returns an enum value indicating the type of link this is. The options are inline (e.g. "
[foo](/url/)
"), reference (e.g. "[foo][bar]
"), shortcut (e.g. "[foo]
"), and empty reference (e.g. "[foo][]
"). This is necessary to easily reproduce a semantically equivalent document to the original, because link is not passed enough information to do this, even with the aid of hoedown_document_link_id, since[foo]
and[foo][]
have the same id ("foo") but are not semantically equivalent (they have different behavior when followed by "(bar)
", for example).hoedown_document_list_depth
This function is used in the context any callback. It returns the current depth of the list containing the element that triggered the callback. For example, in this Markdown:
“a” would have list depth 0, “b” would have list depth 1, and “c” would have list depth 2.
This is important because it allows depth-aware wrapping of e.g. paragraphs, and makes possible some implementations of depth-aware indentation of list items (although it’s not strictly required for this).
hoedown_document_blockquote_depth
This function is used in the context of any callback. It returns the current depth of the blockquote containing the element that triggered the callback. For example, in this Markdown:
“a” would have blockquote depth 0, “b” would have blockquote depth 1, and “c” would have blockquote depth 2.
This is important because it allows depth-aware wrapping of e.g. paragraphs.
hoedown_document_ul_item_char
This function is used in the context of a list item callback. It returns the character used for the unordered list (“+”, “*”, or “-”). This is important because it allows reproducing the original document more faithfully.
hoedown_document_hrule_char
This function is used in the context of an hrule callback. It returns the character used for the hrule (“-”, “*”, or “_”). This is important because it allows reproducing the original document more faithfully.
hoedown_document_fencedcode_char
This function is used in the context of a code block callback. It returns the character used for the fenced code block (“`”, “~”, or 0 if it’s not a fenced code block). This is important because it allows reproducing the original document more faithfully, and makes it much, much easier to avoid producing invalid output (e.g. you can nest “```” inside a “
~” codeblock, so it’s important to know that the outer codeblock uses “~”).Added callbacks
ref
This callback is called when a link reference definition is parsed. It’s passed a hoedown_buffer containing the original text, along with an offset and size into that hoedown_buffer. This is important because there is no other way for a hoedown_renderer to see link reference definitions (the renderer basically sees all links as inline links). This allows these definitions to be reproduced in the output.
footnote_ref_def
This callback is called when a footnote reference definition is parsed. It’s passed a hoedown_buffer containing the original text, along with an offset and size into that hoedown_buffer. This is important because there is no other way for a hoedown_renderer to see duplicate footnote reference definitions (the footnote_def callback is only called for the first footnote definition for each id). Duplicates are usually a user error, but it’s nice to be able to reproduce them anyway.
Testing
These additions do not make a difference to the ordinary HTML renderer, so they are tested slightly differently. I've implemented a "context_test" renderer that renders to text information from the functions and callbacks above. Tests corresponding to this reside in test/Tests/context.