Unable to get Panflute filter comments.py to work correctly

sergiocorreia / panflute

An Pythonic alternative to John MacFarlane's pandocfilters, with extra helper functions

BSD 3-Clause "New" or "Revised" License

500 stars 59 forks source link

 <VirtualHost *:80> #ServerName www.example.com ServerAdmin webmaster@localhost DocumentRoot /var/www/html #LogLevel info ssl:warn ErrorLog ${APACHE_LOG_DIR}/error.log CustomLog ${APACHE_LOG_DIR}/access.log combined #Include conf-available/serve-cgi-bin.conf </VirtualHost> 

 <xmp> <VirtualHost *:80> #ServerName www.example.com ServerAdmin webmaster@localhost DocumentRoot /var/www/html #LogLevel info ssl:warn ErrorLog ${APACHE_LOG_DIR}/error.log CustomLog ${APACHE_LOG_DIR}/access.log combined #Include conf-available/serve-cgi-bin.conf </VirtualHost> </xmp> 

Hey @rgval,

I think I can see why the syntax of your apache config files are getting confused by Pandoc. While they might look like HTML files, they are actually different type of file that is not designed for marking up a website. When Pandoc tries to convert them with HTML in mind, the output comes out all weird.

In an old Pandoc issue about reading plaintext, someone suggests converting it to markdown by adding three backticks (```) to new lines at the beginning and end of the document and adding the .md extension. Unless you have three backticks appear somewhere else in your document (in which case, see the above link for a workaround), Pandoc won't mess with the content of the file because it will interpret it as a block of code. In other words, this should allow you to convert to .docx as you describe.

Here is a command that should work in Ubuntu to make this conversion (see here for how this works):

sed -e '1 s/^/```\n/' -e '$ s/$/\n```/' FILE > temp.md

I know it isn't ideal to do that conversion before we send it through Pandoc / panflute, but it's the best I could come up with. Now, we need a panflute filter to take out all those comments!

import re
import panflute as pf

def remove_comments(text):
    return re.sub("(<!--.*?-->\n?)", "", text, flags=re.DOTALL)

def action(e, doc):
    if isinstance(e, pf.CodeBlock):
        e.text = remove_comments(e.text)

def main():
    pf.toJSONFilter(action=action)

if __name__ == "__main__":
    main()

I modified the comments.py script you mentioned, adding a new remove_comments() function that uses regex (a regular expression) to remove things that looks like comments, i.e. anything wrapped in . I should note that regular expressions can be pretty unwieldy, both because they are difficult for humans to interpret and they are not very smart about finding things in context. However for a project like this where you just need to get something done, I think it will work well enough. If you end up wanting to parse through the apache config files in a more complex way, I'd recommend using a package like this one to get started.

Running our temporary markdown file through Pandoc using this filter successfully removes the comments. First, save the above code into comments.py file in your current directory. Here is the command I used to test it:

pandoc --filter comments.py --output out.docx temp.md

If you're like me and you just want one command to do everything:

sed -e '1 s/^/```\n/' -e '$ s/$/\n```/' PATH/TO/INPUTFILE | pandoc --from markdown --filter comments.py --output finaloutput.docx

I hope this helps, and good luck on your project!

sergiocorreia / panflute

Unable to get Panflute filter comments.py to work correctly #212