mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
787 stars 146 forks source link

Feature suggestion: Wiki style links #90

Closed niblo closed 4 years ago

niblo commented 5 years ago

Wiki links look like [[a link]] or [[1234|some text]]. The link target of the resulting link is dependent on the context (see Wikipedia for example).

The first is the form [[abc]] with just the link destination. The second form [[abc|def]] has a text def. The content of that text could be defined as the content of a link text as per the CommonMark specification. A possible exception could be that it may contain single square brackets (but obviously not two consecutive square brackets, which would close the wiki link). If the wiki link contains multiple |, only the first is interpreted as the delimiter between the link destination and the text. The rest are interpreted as literal |.

A wiki link destination is a sequence of one or more (limits?) characters appearing inside a wiki link, before the first |. If the destination is one character, that character may not be whitespace. The destination may contain one line break.

With the feature enabled, the wiki link syntax takes precedence over the regular link syntax. The following is example 555 from the CommonMark spec:

[[*foo* bar]]

[*foo* bar]: /url "title"

The above should render as <p>[<a href="/url" title="title"><em>foo</em> bar</a>]</p>. With the wiki link feature enabled, the (regular) link would not be recognized at all, and would instead be recognized as a wiki link with the destination *foo* bar.

@mity posed some good questions in response to an e-mail I sent him regarding this suggestion. After revising it I hope that most of the questions have been answered. What remains is the possibility of unintentional interaction with the GitHub style table syntax that makes use of |.

mity commented 5 years ago

Let me to place here some my concerns which imho need some clarification or deeper consideration.

Nested inline elements

Consider this

[[**foo** *bar*]]

Should the wiki link ID be the verbatim **foo** *bar* or foo bar? (I tend to vote for foo bar.)

Should the wiki link text be the verbatim **foo** *bar* or should be translated into <b>foo</b> <em>bar</em>? (I tend to vote for <b>foo</b> <em>bar</em>)

This may have strong impact on the API of the new feature. E.g. should be the text be provided just as a member of a new wikilink detail structure, or should it be treated as "enter span/leave span" which can contain nested arbitrary span and text callback calls like e.g. the inline link allows?

Priority

Current priority list for the inline stuff goes as follows:

  1. backslash escapes (e.g. \*), code spans (`foo`), autolink (<http://example.com>), inline raw HTML, NULL-character handling (this very tight stuff is handled on the fly directly in md_collect_marks(); the others below later in md_analyze_inlines())
  2. Entities (&nbsp;)
  3. Tables (breaking table rows into cells on |)
  4. Inline links (e.g. [foo](http://example.com) ) and reference links ([id], [id][], [foo][id]).
  5. Emphasis (*foo*, _foo_), strong emphasis (**foo**, __foo__), strike-though (~foo~), latex math ($foo$, $$foo$$), permissive autolinks (http://example.com, foo@example.com).

(Lower priority stuff never "crosses" any stuff already resolved in higher priority steps, so in something with the logical structure of <a><b></a></b> only the <a>...</a> or <b>...</b> is recognized as a valid Markdown syntax depending what has the higher priority. If the priority is the same, then <a>...</a> is used as it is seen earlier in the left-to-right scan. The low priority stuff is then used as a verbatim text unless it can be resolved yet later without the "crossing" so the low priority can still encricle the high priority as in <low><high></invalidlow></high></low>.)

My gut feeling says wiki links should be the same as the other links (right now, 4) so oprdinary links and wikilinks are resolved in the left-to-right fashion. But the problem would then be that tables could not contain wikilink with pipe ([[foo|bar]]) as it would be seen rather as a table cell delimiter.

If we want to allow that, we would have add the wikilinks into the level (2) instead.

Or maybe we might consider to do some priority shuffling. That can be generally tricky due the limitations required by CommonMark standard, but swapping the levels 3 and 4 might be relatively safe though as tables is an extension. At least I can foresee only change of behavior in inputs similar to this:

A | B
---|---
[xxx|yyy](http://url)   .... an inline link or two table cells?

Hopefully, that can be seen as a corner case.

mity commented 5 years ago

Conflict

Consider

[[foo](url)

Should it be translated into [<a href="url">foo</a> or should we rather say that [[ and ] do not match and leave it as it is?

(No idea what is better here.)

niblo commented 5 years ago
[[**foo** *bar*]]

Should the wiki link ID be the verbatim **foo** *bar* or foo bar? (I tend to vote for foo bar.)

It's a good point. I guess there is no issue for a caller of md4c as long as md4c can provide the link ID, stripped of any markup. On the other hand, for external tools that need to integrate and interact with the raw Markdown text, it makes things simpler not having to parse the contents of that ID; it makes is unambiguous what the ID is. My opinion is, if you want markup on the link, use the link text/label for that.

Should the wiki link text be the verbatim **foo** *bar* or should be translated into <b>foo</b> <em>bar</em>? (I tend to vote for <b>foo</b> <em>bar</em>)

This may have strong impact on the API of the new feature. E.g. should be the text be provided just as a member of a new wikilink detail structure, or should it be treated as "enter span/leave span" which can contain nested arbitrary span and text callback calls like e.g. the inline link allows?

Yes, just like the CommonMark specification says about link text: A link text consists of a sequence of zero or more inline elements [...]. But there is some limit, surely?

Consider [[foo](url). Should it be translated into [<a href="url">foo</a> or should we rather say that [[ and ] do not match and leave it as it is?

I think the former; a regular link with a [ before it.

Your writeup on the priority for inlines was interesting. I will use it as a guiding map when I look through the code.

mity commented 5 years ago

On the other hand, for external tools that need to integrate and interact with the raw Markdown text, it makes things simpler not having to parse the contents of that ID;

Do you mean something like simple grep or sed scanning the document for the list of IDs? That cannot work reliably no matter what we do. You have to parse the Markdown as Markdown to make it reliable. Consider there may be a code block or code span which may contain anything, including what may otherwise look as a wikilink or any other snippet of Markdown syntax.

But there is some limit, surely?

No limit in terms of length or count. The only limit is the link body may only contain balanced pairs of (unescaped) [ ... ] . To avoid non-linear runtime behavior, if that's what you are afraid of here, the parser maintains stack of the unresolved openers [ and tries them to resolve when it reaches the corresponding ] so all the links are handled in one forward scan over the [ and ] marks collected in md_collect_marks().

See https://spec.commonmark.org/0.29/#link-text and https://spec.commonmark.org/0.29/#phase-2-inline-structure (the section An algorithm for parsing nested emphasis and links)

Your writeup on the priority for inlines was interesting. I will use it as a guiding map when I look through the code.

See also https://talk.commonmark.org/t/why-is-md4c-so-fast-c/2520/2?u=mity (especially the point 4). It is a little bit outdated (the simple high-priority stuff was moved directly into md_collect_marks() so the description remains valid only for the priority levels >= 2)

niblo commented 5 years ago

On the other hand, for external tools that need to integrate and interact with the raw Markdown text, it makes things simpler not having to parse the contents of that ID;

Do you mean something like simple grep or sed scanning the document for the list of IDs? That cannot work reliably no matter what we do. You have to parse the Markdown as Markdown to make it reliable. Consider there may be a code block or code span which may contain anything, including what may otherwise look as a wikilink or any other snippet of Markdown syntax.

I was thinking about tighter integration like putting the cursor on a wiki link in a text editor and retrieving information on-the-fly about the page that is linked to. Not having to implement Markdown inline parsing would make getting that ID much simpler. Something similar to that would be my use case at least. And for what its worth, I don't know of any wiki that does not treat the link ID verbatim.

mity commented 5 years ago

I was thinking about tighter integration like putting the cursor on a wiki link in a text editor and retrieving information on-the-fly about the page that is linked to. Not having to implement Markdown inline parsing would make getting that ID much simpler. Something similar to that would be my use case at least. And for what its worth, I don't know of any wiki that does not treat the link ID verbatim.

I've got the message that keeping the ID verbatim may be good idea, at least from some POV.

But how can your editor know at least whether it is inside a wikilink syntax and it should scan for it from the cursor both to left and right to reach the [[ and ]]? Again, consider your cursor may be inside a code span and that can contain some [[ and ]], yet it is not a wikilink. Or that your cursor may be in a list item, ]] follows, but preceding [[ may be in a previous list item on the previous line. Or they may be in different table cells. Or that some of the brackets may be escaped. Or preceded with \, yet not being escaped because the backslash itself may be escaped.... Parsing Markdown is complicated, that's why MD4C has thousands of lines of code.

On the other hand. There were some other feature requests in the past that would allow to use MD4C for e.g. a syntax highlighting of Markdown format in a text editor. So far it has not be implemented, but there is a dummy never-called callback (MD_PARSER::syntax()). (It even does not yet have a clear function prototype.)

My preliminary idea was that when (in the future) an app sets the callback non-NULL the callback would be called during the parsing in order to inform the app about things like "Here at the offset 1234, an inline link starts. Here at offset 2000, an inline URL starts, Here at offset 2010 it ends etc."; or maybe it could rather work in the terms of ranges rather then begin/end events.

The primary motivation was to allow app to do syntax highlighting of a Markdown source, but I mention it because it (when eventually implemented) might provide the info you need.

(But I see that as an orthogonal to the wikilinks extension so if interested in that, we should open yet another issue for it and not discuss it here.)

niblo commented 5 years ago

But how can your editor know at least whether it is inside a wikilink syntax and it should scan for it from the cursor both to left and right to reach the [[ and ]]? Again, consider your cursor may be inside a code span and that can contain some [[ and ]], yet it is not a wikilink. Or that your cursor may be in a list item, ]] follows, but preceding [[ may be in a previous list item on the previous line.

Or they may be in different table cells. Or that some of the brackets may be escaped. Or preceded with \, yet not being escaped because the backslash itself may be escaped.... Parsing Markdown is complicated, that's why MD4C has thousands of lines of code.

You are right, and it won't work perfectly, but making such an error is fairly innocent.

On the other hand. There were some other feature requests in the past that would allow to use MD4C for e.g. a syntax highlighting of Markdown format in a text editor. So far it has not be implemented, but there is a dummy never-called callback (MD_PARSER::syntax()). (It even does not yet have a clear function prototype.)

My preliminary idea was that when (in the future) an app sets the callback non-NULL the callback would be called during the parsing in order to inform the app about things like "Here at the offset 1234, an inline link starts. Here at offset 2000, an inline URL starts, Here at offset 2010 it ends etc."; or maybe it could rather work in the terms of ranges rather then begin/end events.

The primary motivation was to allow app to do syntax highlighting of a Markdown source, but I mention it because it (when eventually implemented) might provide the info you need.

That would be very, very useful, and that would make a good interface too.

mity commented 5 years ago

@niblo I think we are more or less in agreement how it could work.

You asked for some hints for implementation in the personal e-mail, so here are my two cents how I would do it.

I would start only by [[foo]] (no pipe) first.

  1. Add a new MD_SPANTYPE for the wikilinks and some new corresponding structure to provide the wikilink ID to the caller, and a new fag allowing it, into the public header <md4c.h>.

  2. I would likely tried to change the priority so the wiki links and ordinary inline links have the same, and are above breaking the table cells so | is 1st tried to be used for the wiki links rather as a table cell delimiter. (This should be easy change in md_analyze_inlines()).

  3. Core of it should be possible to implement in md_resolve_links(). If the wikilinks extension is enabled, then instead of unconditional "resolving" of a normal link, the code should check whether the opening [ is preceded by another unresolved [ mark and closing ] is followed by another unresolved ] and resolve as a potential wikilink.

    That has to involve killing the outer MD_MARK structures from further processing (e.g. by setting their MD_MARK::ch to 'D') and expanding the ones you are analyzing in the current md_resolve_links() call to cover both characters so the [ and ] are excluded from the normal text flow and do not appear in the text() callback.

    (Study how MD_MARK structs work; also note the function is called in some specific order which guarantees that for [... [....]...] the inner is analyzed 1st. See also how ordinary links store some richer info (destination, link title) into the MD_MARK structure and do it for the wikilink ID.)

  4. The code in md_process_inlines() which fires the events for the link has to be expanded to check whether the opener/closer MD_MARK is two-char long, and in such case fire rather the events for the wikilink.

Only when the above works, I would start thinking about the [[foo|bar]]. It should be possible like this:

  1. Make sure md_build_mark_char_map() enables seeing | when the wikilinks extension is allowed (it already happens for the tables extension).

  2. Then md_resolve_links() would have to additionally check whether there is an unresolved | MD_MARK between the wikilink [[ and ]] marks. If so, it should resolve that MD_MARK too (so that it gets unusable if a table handler later examines that mark as well.) and additionally expand the closer ]] MD_MARK so it hides the ID from normal text flow.

As part of the PR, please add also some new file into the test dir which serves both as a simple reference manual for it and test suite for it (see e.g. tables.txt). Also make it run from scripts/run-tests.sh.

mity commented 5 years ago

Wait. The pipe case (steps 5 and 6) might likely lead to O(n^2) and potential DoS attacks for input like

[[[[[[id]]]]]] ...

because we would likely have to scan for the pipe mark for each pair of [[ and corresponding ]] over bigger and bigger range of MD_MARK structures. (as md_resolve_links() gets called from in the innermost to the outermost pairs of the opener/closer marks.

So that part needs some more thinking.

niblo commented 4 years ago

Added in #92.