sergiocorreia / panflute

An Pythonic alternative to John MacFarlane's pandocfilters, with extra helper functions
http://scorreia.com/software/panflute/
BSD 3-Clause "New" or "Revised" License
500 stars 59 forks source link

Using panflute as a writer rather than a filter #217

Closed amine-aboufirass closed 2 years ago

amine-aboufirass commented 2 years ago

Is it possible to trick panflute into acting more like a pandoc writer than a pandoc filter? In particular, I am interested in parsing the AST to go from something like this in Markdown:

- item
  - subitem
- item

To something like this in LaTeX:

\begin{itemize}
    \item item
    \begin{itemize}
        \item subitem
    \end{itemize}
\end{itemize}

I know there are already built-in writers for this in pandoc, but I'm very much interested in building my own.

How do I go about doing this and where can I start?

amine-aboufirass commented 2 years ago

Ok, so I built a simple script to try to do this. Here's ref.md:

- item
    - subitem
- item

Here's write.py:

import panflute as pf

def action(elem, doc):
    if isinstance(elem, pf.elements.BulletList):
        print(elem)

if __name__ == "__main__":
    with open("ref.md") as fs:
        markdown = fs.read()

    doc = pf.convert_text(markdown, standalone=True)
    doc.walk(action)

If I do python write.py I get the following output:

BulletList(ListItem(Plain(Str(subitem))))
BulletList(ListItem(Plain(Str(item)) BulletList(ListItem(Plain(Str(subitem))))) ListItem(Plain(Str(item))))

Basically, doc.walk goes through the deepest items in the list before reaching the shallowest items. So I'm not sure how I can use panflute to achieve what I want.

I've come across this with "classic style" writers in Lua and there is a thread in the pandoc mailing list about this issue. So I end up with the following (more specific) questions:

amine-aboufirass commented 2 years ago

Looks like this has also been discussed here https://github.com/sergiocorreia/panflute/issues/84

Though rather inconclusively, in my opinion....

sergiocorreia commented 2 years ago

Something like this should work:

import panflute as pf

def action(elem, doc):

    if isinstance(elem, pf.BulletList):
        # Instead of pf.stringify(item) you can also do item.content[0].content[0].text
        text = '\n'.join(pf.stringify(item) for item in elem.content)
        text = text.split('\n')
        text = ''.join('\n    ' + row for row in text)
        text = '\n\\begin{itemize}' + text + '\n\\end{itemize}'
        return pf.CodeBlock(text)

    elif isinstance(elem, pf.ListItem) and isinstance(elem.parent, pf.BulletList):
        text = r'\item ' + pf.stringify(elem)
        return pf.ListItem(pf.Plain(pf.Str(text)))

if __name__ == "__main__":
    with open("ref.md") as fs:
        markdown = fs.read()

    doc = pf.convert_text(markdown, standalone=True)
    doc.walk(action)
    print(doc.content[0].text)

Output:

\begin{itemize}
    \item item1
    \begin{itemize}
        \item subitem1
    \end{itemize}
    \item item2
\end{itemize}

Basically, this filter would create a code block (it can be anything really) that stores the formatted text. It's a bit more cumbersome that what I would have liked, but if you don't care much about maintaining indentation as you go deeper into the nesting, then you can simplify the join/split lines.

Also, note that here we exploit the fact that we go depth-first, as we first format the more nested items. You could also create more customized walkers that just go shallow-first and thus simplify the filter code.

amine-aboufirass commented 2 years ago

@sergiocorreia thanks for your response. That works.

The end result, however, is still wrapped in a CodeBlock. This is a problem for me because I'd like to use panflute in conjunction with a template which I already have defined:

\documentclass[a4paper]{article}

\usepackage{cite}
\usepackage[nonumberlist]{glossaries}
\usepackage{hyperref}
\usepackage[margin=2cm]{geometry}
\usepackage{graphicx}
\usepackage{array}
\usepackage{mfirstuc}
\usepackage[official]{eurosym}

\makeglossaries

\graphicspath{{./images/}}

\newglossaryentry{LabView}
{
    name={LabView},
    description={
        system-design platorm and development environment for associated visual 
        programming language%
    }
}

\newglossaryentry{VI}
{
    name={VI},
    description={Virtual Instrument}
}

\begin{document}

    \tableofcontents
    \clearpage

    $body$

    \clearpage
    \bibliography{bibliography}
    \bibliographystyle{abbrv}
    \clearpage
    \printglossaries

\end{document}

So the result of whatever gets processed by panflute is dumped into the placeholder $body$. I rewrote your code in the panflute filter format:

import panflute as pf

def action(elem, doc):
    if isinstance(elem, pf.elements.BulletList):
        text = '\n'.join(pf.stringify(item) for item in elem.content)
        text = text.split('\n')
        text = ''.join('\n    ' + row for row in text)
        text = '\n\\begin{itemize}' + text + '\n\\end{itemize}'

        return pf.CodeBlock(text)

    elif isinstance(elem, pf.ListItem) and isinstance(elem.parent, pf.BulletList):
        text = r'\item ' + pf.stringify(elem)
        return pf.ListItem(pf.Plain(pf.Str(text)))

def main(doc=None):
    return pf.run_filter(action, doc = doc)

if __name__ == "__main__":
    main()

Using the above, I ran the following command:

pandoc -F write.py --template custom_template.latex ref.md

Which yielded the following output:

\documentclass[a4paper]{article}

\usepackage{luacode}
\usepackage{cite}
\usepackage[nonumberlist]{glossaries}
\usepackage{hyperref}
\usepackage[margin=2cm]{geometry}
\usepackage{graphicx}
\usepackage{array}
\usepackage{mfirstuc}
\usepackage[official]{eurosym}
\usepackage{luacode}

\makeglossaries

\graphicspath{{./images/}}

\newglossaryentry{LabView}
{
    name={LabView},
    description={
        system-design platorm and development environment for associated visual 
        programming language%
    }
}

\newglossaryentry{VI}
{
    name={VI},
    description={Virtual Instrument}
}

\begin{document}

    \tableofcontents
    \clearpage

    <pre><code>
\begin{itemize}
    \item item
    \begin{itemize}
        \item subitem
    \end{itemize}
    \item item
\end{itemize}</code></pre>

    \clearpage
    \bibliography{bibliography}
    \bibliographystyle{abbrv}
    \clearpage
    \printglossaries

\end{document}

As you can see the content is added where it needs to be (i.e. $body$, but the <pre> tag is still there, which makes sense because we are wrapping stuff in pf.CodeBlock in the script you proposed.

So technically, pandoc is still writing to html (the default), because panflute acts as a filter and not a writer. I'd like to circumvent the pandoc writer and dump straight to my template. Is there some sort of workaround?

sergiocorreia commented 2 years ago

Maybe replacing CodeBlock with something else would work? EG using "Plain"?

amine-aboufirass commented 2 years ago

Yes, thanks! But you do have to wrap the text in an Str object first. This is what worked for me:

import panflute as pf

def action(elem, doc):
    if isinstance(elem, pf.elements.BulletList):
        text = '\n'.join(pf.stringify(item) for item in elem.content)
        text = text.split('\n')
        text = ''.join('\n    ' + row for row in text)
        text = '\n\\begin{itemize}' + text + '\n\\end{itemize}'

        return pf.Plain(pf.Str(text))

    elif isinstance(elem, pf.ListItem) and isinstance(elem.parent, pf.BulletList):
        text = r'\item ' + pf.stringify(elem)
        return pf.ListItem(pf.Plain(pf.Str(text)))

def main(doc=None):
    return pf.run_filter(action, doc = doc)

if __name__ == "__main__":
    main()
amine-aboufirass commented 2 years ago

Let me just add that the pandoc command must be adjusted after this, otherwise the default tex writer in pandoc will kick in, which we don't want in this case. The option "plain" will forcibly disable the behavior:

pandoc -F write.py -t plain --template custom_template.latex -o test.tex test.md

badumont commented 2 years ago

Just to explain what happened in your first try above: the <code> and <pre> tags appear because pandoc -F write.py --template custom_template.latex ref.md does not specify an output format (so it defaults to HTML) and you wrap your LaTeX code in a CodeBlock, which is like a listing in LaTeX. You should get the desired output by replacing return pf.CodeBlock(text) with return pf.RawBlock(text, 'latex') (or format='latex'?) and by specifying on the command line either -t latex or an output file ending with .tex.