ocaml / omd

extensible Markdown library and tool in "pure OCaml"
ISC License
156 stars 45 forks source link

[WIP] add a CST structure #306

Open tatchi opened 1 year ago

tatchi commented 1 year ago

This is a WIP to add a CST structure that preserves some details of the original markdown syntax. This is a draft PR and still needs a lot of work. I'm opening this to open up discussion and get early feedback.

The first two commits are from @patricoferris' work on https://github.com/patricoferris/omd/tree/omd-print. There are a few commits after that are not very relevant.

https://github.com/ocaml/omd/commit/a3a8faebaaaf22931db09a2f492fdfd6f166dd5d is where the CST structure is added. This is a very basic implementation, I basically copied a lot of the current AST code and implemented functions that go from AST to CST. I'm not sure I understood the discussion in https://github.com/ocaml/omd/issues/223 regarding the CST structure and how it should be implemented. I'm pretty sure there's a much better solution than what I have here.

Subsequent commits add details to the CST so that information is not lost when trying to print the structure back to string.

What I realized is that there is some information that we need to keep in order not to change the "meaning" of the markdown.

This is the case with:

\## hello

In master and when parsing the above markdown into an AST structure, we correctly parse it as a regular text and not a heading due to the escape char \, but it's the escape character is not preserved in the AST:

# Omd.of_string "\\## hello";;
- : Omd.doc = [Omd.Paragraph ([], Omd__.Ast_inline.Text ([], "## hello"))]

So when we parse it back it becomes a heading which is obviously not correct and need to be fixed

Besides that, there are other missing pieces of information that make the string we generate different from the original, but don't change the "meaning" of the markdown. That's the case with the emphasis character, for example.

# Omd.of_string "__hello__";;
- : Omd.doc =
[Omd.Paragraph ([],
  Omd__.Ast_inline.Strong ([], Omd__.Ast_inline.Text ([], "hello")))]

We don't store in the AST if the emphasis character is _ or *. But in the end, we can choose whatever we want when we print the AST back to a string, it won't change the "meaning" and the HTML will be the same. Actually, Pandoc doesn't keep this information either:

printf "__hello__" | pandoc --from commonmark --to json | pandoc --from json --to commonmark
**hello**

I'm wondering what we're aiming for in our case? Do we strictly want to print back the exact same string we parsed, or is it fine as long as the markdown result/HTML output is the same?