seth-brown / formd

A Markdown formatting tool
MIT License
147 stars 19 forks source link

Adjacent square brackets inside code are mistakenly matched as links #15

Closed anko closed 4 years ago

anko commented 9 years ago

Inside backticks

Input file:

`[x][y]`

Command: formd -r

Output:

`[x][1]`

[1]: y

Expected output:

`[x][y]`

Inside code blocks

Input file:

    [x][y]

(note the 4-space indent)

Command: formd -r

Output:

[x][1]

[1]: y

Expected output:

    [x][y]

Similarly for code indented by 1 tab, instead of 4 spaces—the alternative syntax.


Maybe it would be better to internally use a dedicated markdown parser like mdast rather than reimplement something that can catch all these fiddly details?

anko commented 9 years ago

Answering my own question regarding mdast with a brief review:

mdast's CLI app competes with formd: It converts links to inline-style links by default, but --setting "reference-links" makes it do reference-style.

It handles this edge case well:

$ mdast --setting "reference-links" <<< "    [x][y]"
[x][y]
$ mdast <<< "    [x][y]"
[x][y]

Formd still has the advantage that it operates directly on text. Mdast's AST is abstract by definition, so it doesn't encode formatting details and might hence change them.

Formd is also marginally faster (presumably due to having no AST), but even a big MD doc ran acceptably fast in both:

$ time formd < syntax.text > /dev/null
formd < syntax.text > /dev/null  0.02s user 0.00s system 94% cpu 0.024 total
$ time mdast < syntax.text > /dev/null
mdast < syntax.text > /dev/null  0.10s user 0.01s system 105% cpu 0.095 total

To differentiate the projects, it might be best for formd to continue operating directly on markdown source.

seth-brown commented 9 years ago

Thanks for the detailed information An. Yes, using a real Markdown parser is definitely the way to go. When I first wrote formd, there were no Markdown parsers I could use, so I was forced to use a hacky regex. Most parsers convert Markdown to HTML, but formd needs a way to extract Markdown objects and format them.

One potential solution I've considered is to convert the Markdown to HTML first, parse out the URLs from the HTML, convert the HTML back to Markdown, and then format the links as reference or inline Markdown. I actually have an unpublished code branch that basically does this. The only remaining issue is that I don't know of a parser that can convert HTML back to Markdown. Do you know of any parser that can do this?

mdast looks like a good solution too, but I'm not sure if using a Node.js AST tool is the right approach. As you point out, it's slower and would create dependencies on mdast and Node on top of the current Python requirements. Do you know of any comparable tools written in Python? Again, thanks for the help!

anko commented 9 years ago

Converting MD→HTML→MD would be like refactoring C++ by turning it into C and back again, by which I mean pretty crazy! :smile: I can't find a Python-based HTML-to-Markdown converter, and I imagine it would be super slow.

Closest module I could find to exposing a Markdown syntax tree: Mistune has an open issue about it and a related patch looks to have been merged; this test shows how to access the tree. Ideally that functionality should be extracted to a separate module…

Musings of caution: Turning this project into a syntax tree transformer would essentially mean reimplementing mdast in Python, with all duplication of work that involves.

seth-brown commented 9 years ago

Indeed, it is quite crazy :smile:!

My other idea is to use Pandoc's internal markdown parser with formd. I'm hesitant to do this as it would turn formd into a Haskell project instead of a Python project. Perhaps creating a new formd2 project would be the right way to go.

I'm rather busy with a few projects right now, but I'll try and come back to this issue of integrating a parser utility into formd in the next few weeks. Please feel free to take a stab at integrating Mistune into formd, if you'd like to do so.