Add support for Pandoc style markdown

lukesmurray / markdown-anki-decks

Tool for converting markdown files into anki decks

MIT License

127 stars 12 forks source link

Add support for Pandoc style markdown #22

Open kamalsacranie opened 2 years ago

kamalsacranie commented 2 years ago

I write all my notes using Pandoc, a powerful document converter. Using Pandoc's python package will allow us to solve the backslash escaping problem causing math syntax to be clunky and use dollar and double dollar signs for math.

There would also be a nice option to use Pandoc-style divs:

::: {data-question=}
## Multi-line front

With markdown preserved within the div that is created
:::

Turns into:

<div data-question="">
    <h2>Multi-line front</h2>
    <p>With markdown ....</p>
</div>

I've implemented this change locally and haven't had any problems. There are some cons:

This would mean the project would either depend on Pandoc or;
Have Pandoc as an optional dependency and pass in a boolean flag on the cli

There are many pros, however. It makes .md files which weren't written for Anki conversion need fewer changes. In fact, we would be writing pure Pandoc markdown which gets converted via an abstract syntax tree to html.

Just a thought

lukesmurray commented 2 years ago

Definitely an interesting idea. I'm a huge fan of pandoc so I understand why it would be so enticing. Would you be open to sharing how you implemented it locally so I can check it out? If it doesn't complicate the project too much I'm open to discussing how we could integrate alternative parsers.

kamalsacranie commented 2 years ago

Was quite simple to implement.

import pandoc

def is_math_class(tag: Tag) -> bool:
    """Check if an HTML tag is a math oriented tag generated by pandoc"""
    try:
        return "math" in tag["class"]
    except KeyError:
        return False

def parse_markdown(
    file: str, deck_title_prefix: str, generate_cloze_model: bool
) -> Deck:
    """Parse a markdown string to an anki deck."""
    metadata, markdown_string = frontmatter.parse(read_file(file))
    doc = pandoc.read(markdown_string)
    html = pandoc.write(doc, format="html", options=["--mathjax"])

    soup = BeautifulSoup(html, "html.parser")

    # Find all the math tags using filter
    math_tags = soup.find_all(is_math_class)
    for tag in math_tags:
        tag.unwrap()  # Done for cleaner html in Anki

    ...

The rest of the script is identical

lukesmurray commented 2 years ago

Interesting. On the one hand, we could make this a command-line flag. I would probably call it md-parser and have it accept either python-markdown or pandoc as its value. However, I want to think about the default options we pass to pandoc. I also want to make sure that users' decks don't break if they switch parsers.

As an example, we support multiline questions, which I believe uses python-markdown specific syntax.

So while I love how simple this is, it requires a little bit of thought and care before we can go ahead and add it.

wrvsrx commented 2 years ago

Maybe we can add an option to receive pandoc ast (output in json format) from stdin and operate on it. That allows to convert any input format as long as pandoc support it. That also allows us to add custom pandoc filter. I'm trying to make such change.

lukesmurray commented 2 years ago

Given that we have multiple people interested in this I'll try to add support for this fairly soon.