sergiocorreia / panflute

An Pythonic alternative to John MacFarlane's pandocfilters, with extra helper functions
http://scorreia.com/software/panflute/
BSD 3-Clause "New" or "Revised" License
500 stars 59 forks source link

Unable to get Panflute filter comments.py to work correctly #212

Closed rgval closed 2 years ago

rgval commented 2 years ago

Background: I am trying to convert a large amount of html documents to word (docx) format. The html documents contain configuration files from programs such as apache2, postfix, etc. These configurations files use TAG syntax that confuses pandoc.

Here is an short part of the file I want to convert (apache2 config)

<VirtualHost *:80> ServerAdmin webmaster@localhost DocumentRoot /var/www/html ErrorLog ${APACHE_LOG_DIR}/error.log CustomLog ${APACHE_LOG_DIR}/access.log combined

Ideally, it would show up in the word document exactly as seen above. However, pandoc interprets the virtualhost tag and drops both of them, leaving only the inside information.

Problem: I am trying to use the panflute comments.py filter to force pandoc to ignore the TAG syntax in the files. However no matter how I layout the section, the filter does not seem to work.

Here is how I have tried it:

<!-- BEGIN COMMENT -->

<VirtualHost *:80>
    #ServerName www.example.com
    ServerAdmin webmaster@localhost
    DocumentRoot /var/www/html
    #LogLevel info ssl:warn
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
    #Include conf-available/serve-cgi-bin.conf
</VirtualHost>

<!-- END COMMENT -->

and


<!-- BEGIN COMMENT -->
<xmp>
<VirtualHost *:80>
    #ServerName www.example.com
    ServerAdmin webmaster@localhost
    DocumentRoot /var/www/html
    #LogLevel info ssl:warn
    ErrorLog ${APACHE_LOG_DIR}/error.log
    CustomLog ${APACHE_LOG_DIR}/access.log combined
    #Include conf-available/serve-cgi-bin.conf
</VirtualHost>
</xmp>
<!-- END COMMENT -->

In both configurations there is a blank line before the BEGIN COMMENT line and after the END COMMENT line. In the first configuration the BEGIN COMMENT and END COMMENT lines bracketed by blank lines.

In either case the filter does not appear to work as pandoc continues to remove the virtualhost tag.

I am using: pandoc 2.18, Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3, citeproc 0.7, ipynb 0.2, hslua 2.2.0 panflute 2.1.3 python 3.10 I am running Ubuntu Mate 21.10 CLI: pandoc -s --filter comments.py -f html -t docx input/10.97.123.40.html -o output/10.97.123.40.docx

Any assistance would be greatly appreciated.

jacobwhall commented 2 years ago

Hey @rgval,

I think I can see why the syntax of your apache config files are getting confused by Pandoc. While they might look like HTML files, they are actually different type of file that is not designed for marking up a website. When Pandoc tries to convert them with HTML in mind, the output comes out all weird.

In an old Pandoc issue about reading plaintext, someone suggests converting it to markdown by adding three backticks (```) to new lines at the beginning and end of the document and adding the .md extension. Unless you have three backticks appear somewhere else in your document (in which case, see the above link for a workaround), Pandoc won't mess with the content of the file because it will interpret it as a block of code. In other words, this should allow you to convert to .docx as you describe.

Here is a command that should work in Ubuntu to make this conversion (see here for how this works):

sed -e '1 s/^/```\n/' -e '$ s/$/\n```/' FILE > temp.md

I know it isn't ideal to do that conversion before we send it through Pandoc / panflute, but it's the best I could come up with. Now, we need a panflute filter to take out all those comments!

import re
import panflute as pf

def remove_comments(text):
    return re.sub("(<!--.*?-->\n?)", "", text, flags=re.DOTALL)

def action(e, doc):
    if isinstance(e, pf.CodeBlock):
        e.text = remove_comments(e.text)

def main():
    pf.toJSONFilter(action=action)

if __name__ == "__main__":
    main()

I modified the comments.py script you mentioned, adding a new remove_comments() function that uses regex (a regular expression) to remove things that looks like comments, i.e. anything wrapped in <!-- -->. I should note that regular expressions can be pretty unwieldy, both because they are difficult for humans to interpret and they are not very smart about finding things in context. However for a project like this where you just need to get something done, I think it will work well enough. If you end up wanting to parse through the apache config files in a more complex way, I'd recommend using a package like this one to get started.

Running our temporary markdown file through Pandoc using this filter successfully removes the comments. First, save the above code into comments.py file in your current directory. Here is the command I used to test it:

pandoc --filter comments.py --output out.docx temp.md

If you're like me and you just want one command to do everything:

sed -e '1 s/^/```\n/' -e '$ s/$/\n```/' PATH/TO/INPUTFILE | pandoc --from markdown --filter comments.py --output finaloutput.docx

I hope this helps, and good luck on your project!