write a markdown parser

lotabout commented 8 years ago

Well, this is a also a MUST DO item for me. But I am always fail to really start the project. I've read quite a lot of article about it and cannot find a tutorial. Which is actually not necessary.

Well, the above are just some complains about myself. I should finish it in the end.

Resources

Common Mark -- a markdown that tries to standardize markdown's syntax

lotabout commented 8 years ago

Well, after so many effort of researching, I feel that the most difficult part of implementing a markdown is what to implement! There are so many variants of markdown, they all add their own features and they are useful in real cases of writing.

Why there are so many variants of Markdown

So my conclusion is:

Markdown is popular because its simplicity.
The simplicity causes the lack of features, and that why there are so many variants around.
By adding so many features, markdown would no longer be simple and that makes it no longer markdown.

I've loved markdown the first time I use it. What I love most about it is that it is human readable even before the conversion. I've used reStructuredText, as a whole it is more expressive than markdown for allowing a lot of features, but I think that is why others don't like it, too complex.

What Markdown should be like

The first time that I heard John Gruber(The original author of markdown) is against commonMark which trys to standardize markdown I was confused. But now I an somehow with him.

Markdown should be simple, intuitive and out of box

CommonMark did a great job of specify a lot of corner cases and how we should deal with it. But somehow it make things not so intuitive any more. For example, lists in commonmark should be indented, for example

1. ---
    code here
    ---

However in real case, we might past some code, and result in no indentation:

1. ---
code here
---

commonMark will not treat it as code any more.

What should we developer do?

There are so many corner case of usage that Markdown is unable to and should not cover. Then how should developer choose which to implement?

We all like perfect things, but that is not how the world work. 8/2 rule may suggest that 20% of current features will suite 80% of use case. Take GFM(Github Flavored Markdown) as an example. It is definitely not as conscious and powerful as commonMark(at least I think so), but for me, it is completely OK.

Well, I know I complaint a lot, I am just trying to convince myself that I don't need all of them.

lotabout commented 8 years ago

Obstacle 1: Block Parsing

The first obstacle of parsing markdown is how to parse nested list. To be more specific, is how to parse containing blocks.

First Brainstorm

A first thinking would be to save current indent, and cut out the same indent when following lines come:

1. | a
   | b
   | ... <- to cut out contents before '|'
unindented content should break the list.

In this way, sub blocks won't have to recognize the existence of indentation, all they have to do is parse like there is no indent.

Lazy paragraph stops us to be lazy

However, now we need to support lazy line. What is a lazy line?

1. A very long paragraph can
starts without indent, this is useful when we don't want to indent every line
of a paragraph in the list

------ is the same to -------

1. A very long paragraph can starts without indent, this is useful when we don't want to indent every line of a paragraph in the list

Now the unindented line should not directly stop the list item, we will need information about the contents of the item. For example:

1.
should break the list

1. paragraph
should NOT break the list

So the first brainstorm now fails. The handle of indentation cannot be separated with the contents of its item. Take CommonMark's example: that two blank lines should break out all list, but when its content is a code block, do not break.

So, actually not only lazy paragraph, but also code blocks drive us from separating the parsing into irrelevant pieces.

How to parse then?

The problem now is how to modularize these parts. As far as I know for now, there are 2 methods

Outer-Inner

Keep a reference to current inner-most block. When we got a new line, parse it from outer list to inner list, strip the needed indent. Since we know the type of inner-most block, we are able to determine when to continue and when to break.

if (line.indented):
    strip the indentation
elif (parser.last-child is paragraph):
    continue
else:
    break

This way, if we add a new block type that will consume unindented contents, we'll have to change the way we handle lists.

CommonMark uses this method.

Save and Read

Now we keep the indentation information of each layer of nested lists. Now handle the parsing control to inner contents. For example when we encounter a paragraph inside a list item, it checks the indentation information and continue or break accordingly.

def paragraph():
    if current.indentation < block.indentation:
        continue to process

Bad thing about this method is that all block parser should know and handle the indentation, which seems should not be handled by them.

markdown-it uses this method.

Which one to choose

Since the outer lists and inner block should share information, we will have to choose which end should know about the other. Either outer list or inner block. There is no silver bullet.

But since the syntax will not be changed very often, I think the outer-inner way is better.

lotabout commented 8 years ago

Confused by HTML Block

The main problem are:

It would be too much work to fully parse HTML blocks, need a total HTML parser.
It would be hard and ambiguous to use regex to parse HTML. For example:
1. how to define a HTML block? start/end tag? what if nested?
2. is it possible to use line-based parsing strategy instead of block based?

Block based parsers are more flexible with HTML parsing because they are able to fetch all texts at the same time. Also, nesting can be supported as well.

The parsing strategy would be find the opening-tag and find the corresponding closing-tag in the following texts. Nested tags can be supported if wanted by introducing tag stack.

For line-based parsers(those CommonMark-based) are much more limited in this. Because it should parse the text line by line, and if a opening-tag is split into two lines, the parser would not know. So CommonMark had one rule that the line after a HTML block belongs to this block. Also, a blank line will break the HTML block. Which I strongly don't like.

I think the line-based parsers are more powerful, but it did increase the difficulty to parse HTML blocks. I don't like the CommonMark's way, it seems to be really hacky and not intuitive for both users and developers.

lotabout commented 8 years ago

Inline Parser

After dive into the code for inline parsing, I found that allowing nested structures had LARGELY increased the complexity of the parser. So why would that exists?

I myself only uses Markdown's simple rules such as:

*em* **strong**
[label](link)

Don't even use the link title. So as a developer, such strange use case did bothers me, but I do understand that it is important to include them in the specification.

One the of the use case for nested label when parsing links is use an image as the label, so it would be:

[![img alt](img_link)](Actual Link)

And I am convinced that nested links are needed. Well, I just figured out the commonMark do not actually support this.

lotabout commented 8 years ago

The structure of the output

I always think that tree would be the best for representing the parsing output, however in Markdown's case, maybe tree structure is not needed. And maybe that is why markdown-it chooses token stream as the output.

Also walking through the AST may increase the difficulty of implementing one, because we should add structure to walk through the tree.

lotabout commented 8 years ago

The end of the journey

Finally, I implemented my own markdown parser! Of course based on so many other implementations. And here it is: pymkd. I never thought it would be that hard.

lotabout / ideas